home comics writing pictures archive about

2015-04-12 - Character Encoding: UCS-2/UCS-4

Character Encoding

Extended ASCII encodings allows for a large number of characters to be displayed but requires the use of multiple character sets within a single encoding. This means that a single value can map to multiple characters which causes problems when transmitting data or when a single document needs characters from several sets. To solve this problem a universal character set was created called Unicode. The 2 byte Universal Character Set (UCS-2) uses two bytes to encode all characters which allows for a much larger number of possible characters. The first 256 characters are similar to the English Windows-1252 code page and then characters from a wide variety of other languages and symbols make up the rest of the characters. There are no character sets so every value corresponds to only one character. Unicode characters codes use the format U+XXXX where U+ indicates that it’s a Unicode character and XXXX is the 4 digit hex value of the character.

  Language Range
English U+0000 - U+00FF
Cyrillic U+0400 - U+052F
Arabic U+0600 - U+077F
Greek U+0370 - U+03FF
Hebrew U+0590 - U+05FF
Chinese/Japanese/Korean U+4E00– U+9FFF

As more and more characters were identified and added to the standard it became clear that 2 bytes was not enough. This lead to a 4 byte Universal Character Set (UCS-4) to allow for even more characters. Characters in the range U+00000000 - U+0000FFFF are identical to UCS-2 and make up the Basic Multilingual Plane. Characters above U+0000FFFF make up the supplementary planes.

  Plane Range
Basic Multilingual Plane U+0000 – U+​FFFF
Supplementary Multilingual Plane U+10000 –​ U+1FFFF
Supplementary Ideographic Plane U+20000 – U+​2FFFF
Supplementary Special-purpose Plane U+E0000 – U+​EFFFF

These multi-byte character encodings allows for a vast number of characters to be encoded but the majority of these characters are not commonly used. This makes UCS-2 and especially UCS-4 space inefficient. There’s also compatibility issues with earlier encoding schemes if they are incorrectly read as UCS-2 or UCS-4.

Comments: