2015-07-04 - Hello
I’ve recently finished rewriting the code viewer to be a command line application. The main reason for doing this was to make it easier to debug and improve. While making the command line version I changed some things so now I have to port those changes back to the web. Then I can test issues in the command line version and improve both in unison.
A side benefit of the command line version is that I can easily generate formatted code snippets. That means if I wanted to show how to write “Hello” to the console in C# I could just ask the program to format that snippet and then include it in the page like this.
Or if I wanted to do it in C.
Or Java.
Or even MS-DOS x86 Assembly
I’ve started a page to collect some small “Hello” programs. The plan is to add to it as I learn new languages.
2015-06-20 - Character Encoding: UTF-8/UTF-16/UTF-32
As Unicode expanded there was a counter movement to limit the amount of data required per character. This resulted in several Unicode Transformation Formats (UTF) that aimed to transform the fixed width Unicode characters into a more complex format where only the least commonly used characters required the full 4 bytes.
UTF-8 encodes characters as a series of 8 bit blocks. It was developed for compatibility with ASCII. The first 127 characters are directly encoded as a single byte. Because the first 127 Unicode characters match the original 7-bit ASCII encoding all ASCII text is automatically valid UTF-8 text. Characters above 127 are encoded as a series of blocks with the most significant bits of each byte used to encode the sequencing. The first block will have two or more 1s followed by a 0 with the number of 1s indicating the number of bytes in the sequence. Subsequent blocks will have 10 as their most significant bits. The bits of the character are encoded in the remaining bits.
Character | First Block | Second Block | Third Block | Fourth Block |
---|---|---|---|---|
A U+0041 | 0x41 | NA | NA | NA |
Σ U+03A3 | 0xCE | 0xA3 | NA | NA |
😊 U+1F60A | 0xFD | 0x9F | 0x98 | 0x8A |
The bytes for U+1F60A are calculated by first determining the number of bits required to represent the character. 0x1F60A is 0b11111011000001010 which has 17 bits. 3 bytes provides 16 character bits so 4 bytes are required. The value is padded with 0s to 21 bits and then slotted into the pattern 0b11110xx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx.
UTF-16 represents Unicode characters as 1 or 2 16 bit blocks. It was developed for compatibility with existing UCS-2 implementations. All UCS-2 characters are valid UTF-16 characters and require only 2 bytes. Additional characters are encoded using surrogate pairs. If a 16 bit block has a value in the range 0xD800 to 0xDBFF it is a leading or high surrogate pair and should be followed by the trailing or low surrogate pair in the range 0xDC00 to 0xDFFF. The character value is determined by subtracting the base surrogate value from each pair, 0xD800 and 0xDC00 respectably, then combining the resulting values as two 10 bit chunks and adding 0x010000.
Character | First Block | Second Block |
---|---|---|
A U+0041 | 0x0041 | NA |
Σ U+03A3 | 0x03A3 | NA |
😊 U+1F60A | 0xD83D | 0xDE0A |
The surrogates for U+1F60A are determined by first subtracting 0x010000 from the value to get 0xF60A which, extended to 20 bits, is 0b00001111011000001010. Adding 0xDC00 to the least significant 10 bits gives 0xDE0A which is the low surrogate pair. Adding 0xD800 to the next 10 bits gives 0xD83D which is the high surrogate pair.
UTF-32 represents all code points as a series of 32 bit blocks which is enough to directly represent all current Unicode characters. UTF-32 is identical to UCS-4 but named using the transform pattern to match the other UTF encoding schemes
UTF-8 and UTF-16 are more space efficient than UTF-32 since most characters will only require 1 or 2 bytes. They are also never less efficient as characters can at most use 4 bytes. This space savings comes at the cost of complexity. With variable width characters it’s no longer possible to find the number of characters in a string or the Nth character without reading through the string. Since computers have become more powerful and the transmission of data more common this trade off is acceptable without limiting the number of characters that can be represented by a single encoding.
2015-06-07 - Domestic Flights
Recently I had the chance to go on a short domestic flight. Most of the flights I’ve been on have been long international flights and by comparison this flight was amazing. A little bit of security at the start and then it’s just get on plane, get off plane, go home. No serious security officers scrutinising you. No customs agent prying at you. Nice, simple, and easy.
Even the flight itself was great. I personally enjoy flying, but there’s only so long that it feels comfortable to sit in a tiny seat. With this flight they had barely finished serving snacks when it was time to land.
Good times.
2015-05-24 - Shakespeare Wasn’t a Programmer
There’s a line in one of William Shakespeare's plays that goes something like “A rose by any other name would smell as sweet”. The idea being that what you call something doesn’t impact what it is. A rose doesn’t change its smell because you call it something else. This is not the case in programming where something’s name determines what it is.
There are lots of named things in a program and the names chosen are used to give meaning to their usage within the program. A variable name describes what data it holds. A function name describes what it does. It’s possible to have identical variables with identical data that are differentiated because of their name. An account object may have two floating point values, one named “Credits” and one named “Debits”. They could be stored using the same pattern and contain the same value but because of their names we know they are different.
Going even further without a type a variable is just a set of bits. It’s possible to take a floating point value and treat it as an integer or as a series of characters. The type is required to give meaning to the data stored but that type can be changed and the meaning of the data along with it.
That’s the problem with working in a medium where you’re making everything.