2015-04-19 - Character Encoding: Endianness
Character Encoding
- Part 1 - ASCII
- Part 2 - Extended ASCII
- Part 3 - UCS-2/UCS-4
- Part 4 - Endianness
- Part 5 - UTF-8/UTF-16/UTF-32
- Part 6 - Conclusion
Data is stored on computers as a series of bytes and the order in which these bytes are saved is based on the endianness of the system. big endian systems store the most significant byte (MSB) first while little endian systems store the least significant byte (LSB) first. For example consider the number 305,419,896 which is 0x12345678 in hex. Every two digit hex digit is a byte so in a big endian system the byte 0x12 would be saved first while in a little endian system 0x78 would be saved first.
Endianness | Low Address | High Address | ||
---|---|---|---|---|
Big Endian | 0x12 | 0x34 | 0x56 | 0x78 |
Little Endian | 0x78 | 0x56 | 0x34 | 0x12 |
If the same system saves and loads the data then everything’s fine. If the value is saved on a little endian but read on a big endian system it would get the incorrect value of 2,018,915,346. The same would happen going from a big endian system to a little endian system.
Endianness is not specific to character encodings but it is one of the places where it’s most noticeable as text is commonly sent between computers. Because of this programs are often designed to read and write both ways so endianness is no longer a function of the computer being used but dependant on how the program saves the data.
UCS-2 and UCS-4 solve this problem using a Byte-Order-Mark (BOM). The character U+FEFF is placed at the start of the file to indicate the encoding and endianness of the file. U+FFFE is an invalid character so if it shows up at the beginning of a file then it can be assumed that the alternative endianness should be used. Big-endian is assumed if no BOM is present and the format is not otherwise specified.
BOM | Encoding |
---|---|
0xFEFF | UCS-2 Big Endian |
0xFFFE | UCS-2 Little Endian |
0x0000FEFF | UCS-4 Big Endian |
0xFFFE0000 | UCS-4 Little Endian |
The BOM allows Unicode text to identify it’s own characteristics so that there’s no external information required to display the data correctly.
Comments: