home comics writing pictures archive about

2015-04-19 - Character Encoding: Endianness

Character Encoding

Data is stored on computers as a series of bytes and the order in which these bytes are saved is based on the endianness of the system. big endian systems store the most significant byte (MSB) first while little endian systems store the least significant byte (LSB) first. For example consider the number 305,419,896 which is 0x12345678 in hex. Every two digit hex digit is a byte so in a big endian system the byte 0x12 would be saved first while in a little endian system 0x78 would be saved first.

Endianness Low Address     High Address
Big Endian 0x12 0x34 0x56 0x78
Little Endian 0x78 0x56 0x34 0x12

If the same system saves and loads the data then everything’s fine. If the value is saved on a little endian but read on a big endian system it would get the incorrect value of 2,018,915,346. The same would happen going from a big endian system to a little endian system.

Endianness is not specific to character encodings but it is one of the places where it’s most noticeable as text is commonly sent between computers. Because of this programs are often designed to read and write both ways so endianness is no longer a function of the computer being used but dependant on how the program saves the data.

UCS-2 and UCS-4 solve this problem using a Byte-Order-Mark (BOM). The character U+FEFF is placed at the start of the file to indicate the encoding and endianness of the file. U+FFFE is an invalid character so if it shows up at the beginning of a file then it can be assumed that the alternative endianness should be used. Big-endian is assumed if no BOM is present and the format is not otherwise specified.

BOM Encoding
0xFEFF UCS-2 Big Endian
0xFFFE UCS-2 Little Endian
0x0000FEFF UCS-4 Big Endian
0xFFFE0000 UCS-4 Little Endian

The BOM allows Unicode text to identify it’s own characteristics so that there’s no external information required to display the data correctly.

Comments: