2015-10-18 - Character Encoding: Conclusion
Character Encoding
- Part 1 - ASCII
- Part 2 - Extended ASCII
- Part 3 - UCS-2/UCS-4
- Part 4 - Endianness
- Part 5 - UTF-8/UTF-16/UTF-32
- Part 6 - Conclusion
I actually don’t spend a lot of time dealing with character encodings. It comes up a little bit when dealing with files but even then it’s just a matter of selecting the right encoding. So then why did I spend the time to investigate and write about character encodings?
Partially it’s because it’s a good thing to be aware of. It’s useful to understand what selecting the encoding is doing. Also the problem of how to store non-numeric data comes up a lot in programming and character encodings serve as a good example of that problem.
That being said the main thing is how it shows that the best solution to a problem can change over time. The designers of ASCII made the decision that data savings was more important that number of characters so they limited themselves to 7 bits. As time went on this situation changed; space became cheaper while the desire for more characters grew. This lead to Extended-ASCII and Unicode with their 8 bit, 16 bit, and 32 bit characters. Then things partially flipped around again. With the internet and data being sent all over the world space savings became important again but people didn’t want to lose their extra characters. This lead to the UTF formats that went for space savings under common circumstances at the cost of added complexity.
ASCII and Extended ASCII don’t fit current needs because the limited character set and need for code pages complicates sharing information around the world. Similarly UTF-8 wouldn’t have worked for early computers and Teletype machines because the variable width characters would have made it excessively complicated to implement on the hardware of the time.
I find problems that don’t have singular answers to be the most interesting. Problems with the specifics of the situation impact the requirements and multiple solutions can work simultaneously in different situations. All of these encodings are in use today. Some more than others but the introduction of newer encodings hasn’t destroyed those that came before it. One of the goals of UTF-8 was to be compatible with ASCII for this reason.
I’d also like to point out that this is not an exhaustive list. There are a bunch of other character encodings. Even within Unicode their are other transformations and variations of transformations. These are just the most common ones that I know of and have encountered.
Comments: