2015-04-12 - Character Encoding: UCS-2/UCS-4
Character Encoding
- Part 1 - ASCII
- Part 2 - Extended ASCII
- Part 3 - UCS-2/UCS-4
- Part 4 - Endianness
- Part 5 - UTF-8/UTF-16/UTF-32
- Part 6 - Conclusion
Extended ASCII encodings allows for a large number of characters to be displayed but requires the use of multiple character sets within a single encoding. This means that a single value can map to multiple characters which causes problems when transmitting data or when a single document needs characters from several sets. To solve this problem a universal character set was created called Unicode. The 2 byte Universal Character Set (UCS-2) uses two bytes to encode all characters which allows for a much larger number of possible characters. The first 256 characters are similar to the English Windows-1252 code page and then characters from a wide variety of other languages and symbols make up the rest of the characters. There are no character sets so every value corresponds to only one character. Unicode characters codes use the format U+XXXX where U+ indicates that it’s a Unicode character and XXXX is the 4 digit hex value of the character.
Language | Range |
---|---|
English | U+0000 - U+00FF |
Cyrillic | U+0400 - U+052F |
Arabic | U+0600 - U+077F |
Greek | U+0370 - U+03FF |
Hebrew | U+0590 - U+05FF |
Chinese/Japanese/Korean | U+4E00– U+9FFF |
As more and more characters were identified and added to the standard it became clear that 2 bytes was not enough. This lead to a 4 byte Universal Character Set (UCS-4) to allow for even more characters. Characters in the range U+00000000 - U+0000FFFF are identical to UCS-2 and make up the Basic Multilingual Plane. Characters above U+0000FFFF make up the supplementary planes.
Plane | Range |
---|---|
Basic Multilingual Plane | U+0000 – U+FFFF |
Supplementary Multilingual Plane | U+10000 – U+1FFFF |
Supplementary Ideographic Plane | U+20000 – U+2FFFF |
Supplementary Special-purpose Plane | U+E0000 – U+EFFFF |
These multi-byte character encodings allows for a vast number of characters to be encoded but the majority of these characters are not commonly used. This makes UCS-2 and especially UCS-4 space inefficient. There’s also compatibility issues with earlier encoding schemes if they are incorrectly read as UCS-2 or UCS-4.
Comments: