2014-11-23 - Character Encoding: Extended ASCII
Character Encoding
- Part 1 - ASCII
- Part 2 - Extended ASCII
- Part 3 - UCS-2/UCS-4
- Part 4 - Endianness
- Part 5 - UTF-8/UTF-16/UTF-32
- Part 6 - Conclusion
In the design of ASCII the decision was made to limit the number of available characters in favour of requiring less bits per character. This worked well for the English speaking United states that didn't require additional characters but caused problems for other regions that used languages with accented characters or languages that had non-Latin alphabets. As computers usage spread around the world the need for additional characters rose. At the same time computers standardized to the 8-bit byte undoing the data savings a 7 bit encoding provided. This gave rise to multiple encoding schemes that take the base ASCII encoding and use additional bits to encode more characters. These encoding schemes are collectively referred to as Extended ASCII.
There is no singular Extended ASCII encoding; instead there are several competing and mostly incompatible schemes. Even within a standard there were usually multiple schemes for use with different languages. Computer manufactures were among the first to develop extended ASCII encodings. The original encodings used ASCII as a base and extending it to 8-bits using the additional characters for a variety of uses depending on the region the computer was intended for. These encodings were implemented in hardware and represented the characters the computer was able to display. As computers advanced the encoding was moved into software which meant users could now switch encodings instead of being limited to the set which came with the computer. The term "Character Set" or "Code Page" was used to refer to a specific encoding among the many that the system supported. For example code page 437 on DOS systems was the same encoding built into the original IBM PC and was the code page primarily used in the United States.
ISO/IEC 8859 was developed to try and standardize these extended ASCII encodings. It contained 16 parts (With 8859-12 being abandoned) each intended to be used for a specific set of languages. Some of these parts add accented Latin characters while others add non-Latin characters such as Greek, Hebrew or Cyrillic. Later when windows was being developed they created code pages that partially implemented the original IBM code pages and partially implemented the ISO standard parts. These code pages mapped to the standard in various degrees depending on language. Other computer manufactures and software providers developed their own encoding schemes.
Language | ISO/IEC 8859 | DOS Code Page | Windows Code Page |
---|---|---|---|
English | 8859-1 | CP 437 | Windows‑1252 |
Polish | 8859-2 | CP 852 | Windows-1250 |
Cyrillic | 8859-5 | CP 855 | Windows-1251 |
Arabic | 8859-6 | CP 720 | Windows-1256 |
Greek | 8859-7 | CP 737 | Windows-1253 |
Hebrew | 8859-8 | CP 862 | Windows-1255 |
In addition to the 8-bit encodings there were also several Double-Byte / Multi-Byte character sets (DBCS/MBCS) which were created for Japanese, Chinese, and Korean languages that required a larger numbers of unique characters. Depending on the encoding in question characters could be 1-byte long, 2-bytes long or more. Again an attempt was made to standardize these encodings with the development of ISO/IEC 2022.
Language | ISO/IEC 2022 | DOS/Windows Code Page |
---|---|---|
Japanese | ISO-2022-JP | CP 932 |
Korean | ISO-2022-KR | CP 949 |
Chinese | ISO-2022-CN | CP 936/CP 950 |
Although the extended ASCII encodings allowed for standardized international use of computers it also made it very difficult to transfer information between computers. Text saved in a Cyrillic encoding rarely made send when displayed in Greek. There could even be problems if the region was the same but the system was different. This lead several people to try and create a unified character encoding that supported all languages in a single scheme
Comments: