2015-06-20 - Character Encoding: UTF-8/UTF-16/UTF-32

Character Encoding

As Unicode expanded there was a counter movement to limit the amount of data required per character. This resulted in several Unicode Transformation Formats (UTF) that aimed to transform the fixed width Unicode characters into a more complex format where only the least commonly used characters required the full 4 bytes.

UTF-8 encodes characters as a series of 8 bit blocks. It was developed for compatibility with ASCII. The first 127 characters are directly encoded as a single byte. Because the first 127 Unicode characters match the original 7-bit ASCII encoding all ASCII text is automatically valid UTF-8 text. Characters above 127 are encoded as a series of blocks with the most significant bits of each byte used to encode the sequencing. The first block will have two or more 1s followed by a 0 with the number of 1s indicating the number of bytes in the sequence. Subsequent blocks will have 10 as their most significant bits. The bits of the character are encoded in the remaining bits.

Character	First Block	Second Block	Third Block	Fourth Block
A U+0041	0x41	NA	NA	NA
Σ U+03A3	0xCE	0xA3	NA	NA
😊 U+1F60A	0xFD	0x9F	0x98	0x8A

The bytes for U+1F60A are calculated by first determining the number of bits required to represent the character. 0x1F60A is 0b11111011000001010 which has 17 bits. 3 bytes provides 16 character bits so 4 bytes are required. The value is padded with 0s to 21 bits and then slotted into the pattern 0b11110xx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx.

UTF-16 represents Unicode characters as 1 or 2 16 bit blocks. It was developed for compatibility with existing UCS-2 implementations. All UCS-2 characters are valid UTF-16 characters and require only 2 bytes. Additional characters are encoded using surrogate pairs. If a 16 bit block has a value in the range 0xD800 to 0xDBFF it is a leading or high surrogate pair and should be followed by the trailing or low surrogate pair in the range 0xDC00 to 0xDFFF. The character value is determined by subtracting the base surrogate value from each pair, 0xD800 and 0xDC00 respectably, then combining the resulting values as two 10 bit chunks and adding 0x010000.

Character	First Block	Second Block
A U+0041	0x0041	NA
Σ U+03A3	0x03A3	NA
😊 U+1F60A	0xD83D	0xDE0A

The surrogates for U+1F60A are determined by first subtracting 0x010000 from the value to get 0xF60A which, extended to 20 bits, is 0b00001111011000001010. Adding 0xDC00 to the least significant 10 bits gives 0xDE0A which is the low surrogate pair. Adding 0xD800 to the next 10 bits gives 0xD83D which is the high surrogate pair.

UTF-32 represents all code points as a series of 32 bit blocks which is enough to directly represent all current Unicode characters. UTF-32 is identical to UCS-4 but named using the transform pattern to match the other UTF encoding schemes

UTF-8 and UTF-16 are more space efficient than UTF-32 since most characters will only require 1 or 2 bytes. They are also never less efficient as characters can at most use 4 bytes. This space savings comes at the cost of complexity. With variable width characters it’s no longer possible to find the number of characters in a string or the Nth character without reading through the string. Since computers have become more powerful and the transmission of data more common this trade off is acceptable without limiting the number of characters that can be represented by a single encoding.

2015-06-20 - Character Encoding: UTF-8/UTF-16/UTF-32

Character Encoding

Comments: