2015-05-24 - Shakespeare Wasn’t a Programmer
There’s a line in one of William Shakespeare's plays that goes something like “A rose by any other name would smell as sweet”. The idea being that what you call something doesn’t impact what it is. A rose doesn’t change its smell because you call it something else. This is not the case in programming where something’s name determines what it is.
There are lots of named things in a program and the names chosen are used to give meaning to their usage within the program. A variable name describes what data it holds. A function name describes what it does. It’s possible to have identical variables with identical data that are differentiated because of their name. An account object may have two floating point values, one named “Credits” and one named “Debits”. They could be stored using the same pattern and contain the same value but because of their names we know they are different.
Going even further without a type a variable is just a set of bits. It’s possible to take a floating point value and treat it as an integer or as a series of characters. The type is required to give meaning to the data stored but that type can be changed and the meaning of the data along with it.
That’s the problem with working in a medium where you’re making everything.
2015-05-03 - Branching Storylines
I am a big fan of stories in games. Some of my favourite games are so because of their stories. That’s why it bugs me that there’s a trend in games towards having very open stories. Games where the player is given a lot of control over what happens. They can choose to go here or to not go there. They can choose to save the village or to burn it down. The player is partially responsible for writing the story and that makes it difficult to have a really strong story.
The problem is that these choices make alternatives. A character could be dead in one version of the story but alive in another. The player could be evil in one story while they could be a saint another. It’s difficult to have a strong story when everything in it is so fluid. Now I do understand the reasoning for this. Players like to feel that their actions have impact. That they’re doing something and not just following along.
Personally I would prefer it if the choices where limited and fairly obvious. That way the player still had impact in the story but at the same time there aren’t too many alternatives. That way it’s easier to polish the story and make sure it’s solid while at the same time allowing the player to go through all the possible stories.
That still leaves the problem of endings though. It’s difficult to make a sequel to a game with multiple endings without picking one or just making up a new one that’s a combination of all of them. As I said it’s hard to have a strong story when everything is so fluid.
2015-04-19 - Character Encoding: Endianness
Data is stored on computers as a series of bytes and the order in which these bytes are saved is based on the endianness of the system. big endian systems store the most significant byte (MSB) first while little endian systems store the least significant byte (LSB) first. For example consider the number 305,419,896 which is 0x12345678 in hex. Every two digit hex digit is a byte so in a big endian system the byte 0x12 would be saved first while in a little endian system 0x78 would be saved first.
Endianness | Low Address | High Address | ||
---|---|---|---|---|
Big Endian | 0x12 | 0x34 | 0x56 | 0x78 |
Little Endian | 0x78 | 0x56 | 0x34 | 0x12 |
If the same system saves and loads the data then everything’s fine. If the value is saved on a little endian but read on a big endian system it would get the incorrect value of 2,018,915,346. The same would happen going from a big endian system to a little endian system.
Endianness is not specific to character encodings but it is one of the places where it’s most noticeable as text is commonly sent between computers. Because of this programs are often designed to read and write both ways so endianness is no longer a function of the computer being used but dependant on how the program saves the data.
UCS-2 and UCS-4 solve this problem using a Byte-Order-Mark (BOM). The character U+FEFF is placed at the start of the file to indicate the encoding and endianness of the file. U+FFFE is an invalid character so if it shows up at the beginning of a file then it can be assumed that the alternative endianness should be used. Big-endian is assumed if no BOM is present and the format is not otherwise specified.
BOM | Encoding |
---|---|
0xFEFF | UCS-2 Big Endian |
0xFFFE | UCS-2 Little Endian |
0x0000FEFF | UCS-4 Big Endian |
0xFFFE0000 | UCS-4 Little Endian |
The BOM allows Unicode text to identify it’s own characteristics so that there’s no external information required to display the data correctly.
2015-04-12 - Character Encoding: UCS-2/UCS-4
Extended ASCII encodings allows for a large number of characters to be displayed but requires the use of multiple character sets within a single encoding. This means that a single value can map to multiple characters which causes problems when transmitting data or when a single document needs characters from several sets. To solve this problem a universal character set was created called Unicode. The 2 byte Universal Character Set (UCS-2) uses two bytes to encode all characters which allows for a much larger number of possible characters. The first 256 characters are similar to the English Windows-1252 code page and then characters from a wide variety of other languages and symbols make up the rest of the characters. There are no character sets so every value corresponds to only one character. Unicode characters codes use the format U+XXXX where U+ indicates that it’s a Unicode character and XXXX is the 4 digit hex value of the character.
Language | Range |
---|---|
English | U+0000 - U+00FF |
Cyrillic | U+0400 - U+052F |
Arabic | U+0600 - U+077F |
Greek | U+0370 - U+03FF |
Hebrew | U+0590 - U+05FF |
Chinese/Japanese/Korean | U+4E00– U+9FFF |
As more and more characters were identified and added to the standard it became clear that 2 bytes was not enough. This lead to a 4 byte Universal Character Set (UCS-4) to allow for even more characters. Characters in the range U+00000000 - U+0000FFFF are identical to UCS-2 and make up the Basic Multilingual Plane. Characters above U+0000FFFF make up the supplementary planes.
Plane | Range |
---|---|
Basic Multilingual Plane | U+0000 – U+FFFF |
Supplementary Multilingual Plane | U+10000 – U+1FFFF |
Supplementary Ideographic Plane | U+20000 – U+2FFFF |
Supplementary Special-purpose Plane | U+E0000 – U+EFFFF |
These multi-byte character encodings allows for a vast number of characters to be encoded but the majority of these characters are not commonly used. This makes UCS-2 and especially UCS-4 space inefficient. There’s also compatibility issues with earlier encoding schemes if they are incorrectly read as UCS-2 or UCS-4.