home comics writing pictures archive about

2015-04-12 - Character Encoding: UCS-2/UCS-4

Extended ASCII encodings allows for a large number of characters to be displayed but requires the use of multiple character sets within a single encoding. This means that a single value can map to multiple characters which causes problems when transmitting data or when a single document needs characters from several sets. To solve this problem a universal character set was created called Unicode. The 2 byte Universal Character Set (UCS-2) uses two bytes to encode all characters which allows for a much larger number of possible characters. The first 256 characters are similar to the English Windows-1252 code page and then characters from a wide variety of other languages and symbols make up the rest of the characters. There are no character sets so every value corresponds to only one character. Unicode characters codes use the format U+XXXX where U+ indicates that it’s a Unicode character and XXXX is the 4 digit hex value of the character.

  Language Range
English U+0000 - U+00FF
Cyrillic U+0400 - U+052F
Arabic U+0600 - U+077F
Greek U+0370 - U+03FF
Hebrew U+0590 - U+05FF
Chinese/Japanese/Korean U+4E00– U+9FFF

As more and more characters were identified and added to the standard it became clear that 2 bytes was not enough. This lead to a 4 byte Universal Character Set (UCS-4) to allow for even more characters. Characters in the range U+00000000 - U+0000FFFF are identical to UCS-2 and make up the Basic Multilingual Plane. Characters above U+0000FFFF make up the supplementary planes.

  Plane Range
Basic Multilingual Plane U+0000 – U+​FFFF
Supplementary Multilingual Plane U+10000 –​ U+1FFFF
Supplementary Ideographic Plane U+20000 – U+​2FFFF
Supplementary Special-purpose Plane U+E0000 – U+​EFFFF

These multi-byte character encodings allows for a vast number of characters to be encoded but the majority of these characters are not commonly used. This makes UCS-2 and especially UCS-4 space inefficient. There’s also compatibility issues with earlier encoding schemes if they are incorrectly read as UCS-2 or UCS-4.

2015-04-03 - Two Questions

I tend to get a lot of ideas for projects. The problem is figuring out before you start whether or not you are going to be motivated to work on and finish a project. After all you don’t get much from not working on a project. I ask myself two questions when considering new project ideas and focus on those that have decent answers to both. The first question is what can I learn from it? The second question is what problem does it solve?

Learning is a big part of why I do things. I want to know more than I currently know and grow my skills and understanding of the universe. So if a project can help me learn something than that provides a big incentive to keep at it. The more I work on the project the more practice I get and the more I learn. Understanding what I can learn from a project will give an indicator of my motivation level through the beginning of the project.

Learning only gets you so far though. There's an upper limit on how much you can learn from a single project. What becomes important later on is what the project will ultimately do. Finishing a project will allow me to use it. Understanding what problem the project solves will give an indicator of my motivation through the end of the project.

Usually I start with a topic that I want to learn about and then I think about a project that will solve a useful problem using that topic. That way I can maintain my motivation to work on the project beyond the learning phase and hopefully end up with something that I can show off. A finished project that shows what I know and makes doing something easier.

2015-03-01 - Time Travel

I enjoy time travel stores. They often play with causality in ways that create situations not possible in any other kind of story. That being said thinking too deeply about time travel often ends up leading to problems. Then the entire concept just starts falling apart. Take back to the future for example.

Back to the Future is a fairly simple time travel story. Marty McFly goes back in time. He prevents his parents from meeting. He then has to get his parents to hook up so he can be born. He returns to the present in time to see himself go into the past. The problem is that the Marty going back in time at the end of the movie is not the same one that went back at the start.

When Marty returns home at the end of the movie it's shown that his family's life has been substantially changed by his actions in the past. They're better off, happier, he has a nice truck, and Biff is some kind of weird man servant. Now if his family has changed then logically he has too. Which means that the Marty going back in time at the end of the movie is likely to have a different reaction to things, and that's a problem because it means he won't change the past the same way the other Marty did.

What happens when Marty2 changes the past again? Is the entire universe in flux as different Martys keep going back in time and changing things? Are they creating an endless stream of universes? Maybe the Martys are piling up in the past and the world is going to be consumed by them. Really this is the trouble with time travel. You end up having to deal with multiple realities, multiple universes, multiple instances of the same event. Once you start doing that it becomes really difficult to have something that makes complete sense.

Personally I don't think time travel is possible because it's likely that it would destroy any universe in which it is possible.

2014-11-23 - Character Encoding: Extended ASCII

In the design of ASCII the decision was made to limit the number of available characters in favour of requiring less bits per character. This worked well for the English speaking United states that didn't require additional characters but caused problems for other regions that used languages with accented characters or languages that had non-Latin alphabets. As computers usage spread around the world the need for additional characters rose. At the same time computers standardized to the 8-bit byte undoing the data savings a 7 bit encoding provided. This gave rise to multiple encoding schemes that take the base ASCII encoding and use additional bits to encode more characters. These encoding schemes are collectively referred to as Extended ASCII.

There is no singular Extended ASCII encoding; instead there are several competing and mostly incompatible schemes. Even within a standard there were usually multiple schemes for use with different languages. Computer manufactures were among the first to develop extended ASCII encodings. The original encodings used ASCII as a base and extending it to 8-bits using the additional characters for a variety of uses depending on the region the computer was intended for. These encodings were implemented in hardware and represented the characters the computer was able to display. As computers advanced the encoding was moved into software which meant users could now switch encodings instead of being limited to the set which came with the computer. The term "Character Set" or "Code Page" was used to refer to a specific encoding among the many that the system supported. For example code page 437 on DOS systems was the same encoding built into the original IBM PC and was the code page primarily used in the United States.

ISO/IEC 8859 was developed to try and standardize these extended ASCII encodings. It contained 16 parts (With 8859-12 being abandoned) each intended to be used for a specific set of languages. Some of these parts add accented Latin characters while others add non-Latin characters such as Greek, Hebrew or Cyrillic. Later when windows was being developed they created code pages that partially implemented the original IBM code pages and partially implemented the ISO standard parts. These code pages mapped to the standard in various degrees depending on language. Other computer manufactures and software providers developed their own encoding schemes.

Language ISO/IEC 8859 DOS Code Page Windows Code Page
English 8859-1 CP 437 Windows‑1252
Polish 8859-2 CP 852 Windows-1250
Cyrillic 8859-5 CP 855 Windows-1251
Arabic 8859-6 CP 720 Windows-1256
Greek 8859-7 CP 737 Windows-1253
Hebrew 8859-8 CP 862 Windows-1255

In addition to the 8-bit encodings there were also several Double-Byte / Multi-Byte character sets (DBCS/MBCS) which were created for Japanese, Chinese, and Korean languages that required a larger numbers of unique characters. Depending on the encoding in question characters could be 1-byte long, 2-bytes long or more. Again an attempt was made to standardize these encodings with the development of ISO/IEC 2022.

Language ISO/IEC 2022 DOS/Windows Code Page
Japanese ISO-2022-JP CP 932
Korean ISO-2022-KR CP 949
Chinese ISO-2022-CN CP 936/CP 950

Although the extended ASCII encodings allowed for standardized international use of computers it also made it very difficult to transfer information between computers. Text saved in a Cyrillic encoding rarely made send when displayed in Greek. There could even be problems if the region was the same but the system was different. This lead several people to try and create a unified character encoding that supported all languages in a single scheme