2015-04-19 - Character Encoding: Endianness

Character Encoding

Data is stored on computers as a series of bytes and the order in which these bytes are saved is based on the endianness of the system. big endian systems store the most significant byte (MSB) first while little endian systems store the least significant byte (LSB) first. For example consider the number 305,419,896 which is 0x12345678 in hex. Every two digit hex digit is a byte so in a big endian system the byte 0x12 would be saved first while in a little endian system 0x78 would be saved first.

Endianness	Low Address			High Address
Big Endian	0x12	0x34	0x56	0x78
Little Endian	0x78	0x56	0x34	0x12

If the same system saves and loads the data then everything’s fine. If the value is saved on a little endian but read on a big endian system it would get the incorrect value of 2,018,915,346. The same would happen going from a big endian system to a little endian system.

Endianness is not specific to character encodings but it is one of the places where it’s most noticeable as text is commonly sent between computers. Because of this programs are often designed to read and write both ways so endianness is no longer a function of the computer being used but dependant on how the program saves the data.

UCS-2 and UCS-4 solve this problem using a Byte-Order-Mark (BOM). The character U+FEFF is placed at the start of the file to indicate the encoding and endianness of the file. U+FFFE is an invalid character so if it shows up at the beginning of a file then it can be assumed that the alternative endianness should be used. Big-endian is assumed if no BOM is present and the format is not otherwise specified.

BOM	Encoding
0xFEFF	UCS-2 Big Endian
0xFFFE	UCS-2 Little Endian
0x0000FEFF	UCS-4 Big Endian
0xFFFE0000	UCS-4 Little Endian

The BOM allows Unicode text to identify it’s own characteristics so that there’s no external information required to display the data correctly.

2015-04-19 - Character Encoding: Endianness

Character Encoding

Comments: