home comics writing pictures archive about

2015-07-25 - The Correct Date Format

Occasionally you will hear the British and the Americans fighting over what is the correct date format. The Americans think it’s MM/DD/YYYY and the British think it’s DD/MM/YYYY, but they are both wrong because the correct format is YYYY-MM-DD.

One reason is ambiguity. Without some additional context it’s hard to tell if 12/01/2012 is December 1st, 2012 or the 12th of January, 2012. This ambiguity doesn’t exist for YYYY-MM-DD because there’s no YYYY-DD-MM in common usage. This means that 2012-12-01 is always December 1st, 2012 and there’s no chance of reading the date wrong.

The second reason is sorting. When dates are sorted as text they are ordered by the first part of the date first. This means that MM/DD/YYYY sorts by month, DD/MM/YYYY sorts by day, and YYYY-MM-DD sorts by year. Ideally you want dates sorted chronologically which means it’s better to have the largest part first. YYYY-MM-DD does this while the other formats will mix dates up.

Consider the dates April 2nd 2012, April 3rd 2012, June 15th 2012, April 17th 2013, and June 2nd 2013. The following table shows these dates sorted according to the various date formats.

YYYY-MM-DD MM/DD/YYYY DD/MM/YYYY
2012-04-02 04/02/2012 02/04/2012
2012-04-03 04/03/2012 02/06/2013
2012-06-15 04/17/2013 03/04/2012
2013-04-17 06/02/2013 15/06/2012
2013-06-02 06/15/2012 17/04/2013

With YYYY-MM-DD format everything is in order. All the 2012 dates come before the 2013 dates, All the April dates come before the June dates within the same year. With MM/DD/YYYY we have 2012 dates before and after the 2013 dates. With DD/MM/YYYY we have 2013 dates before 2012 dates and June dates before April dates.

Also I personally think the dashes look better.

2015-07-04 - Hello

I’ve recently finished rewriting the code viewer to be a command line application. The main reason for doing this was to make it easier to debug and improve. While making the command line version I changed some things so now I have to port those changes back to the web. Then I can test issues in the command line version and improve both in unison.

A side benefit of the command line version is that I can easily generate formatted code snippets. That means if I wanted to show how to write “Hello” to the console in C# I could just ask the program to format that snippet and then include it in the page like this.

Console.WriteLine("Hello");

Or if I wanted to do it in C.

printf("Hello\n");

Or Java.

System.out.println("Hello");

Or even MS-DOS x86 Assembly

mov ax, 0200h ; DOS function: Write to file or device
mov dl, 48h ; H
int 021h
mov dl, 65h ; e
int 021h
mov dl, 6Ch ; l
int 021h
mov dl, 6Ch ; l
int 021h
mov dl, 6Fh ; o
int 021h
mov dl, 0Dh ; \r
int 021h
mov dl, 0Ah ; \n
int 021h

I’ve started a page to collect some small “Hello” programs. The plan is to add to it as I learn new languages.

2015-06-20 - Character Encoding: UTF-8/UTF-16/UTF-32

As Unicode expanded there was a counter movement to limit the amount of data required per character. This resulted in several Unicode Transformation Formats (UTF) that aimed to transform the fixed width Unicode characters into a more complex format where only the least commonly used characters required the full 4 bytes.

UTF-8 encodes characters as a series of 8 bit blocks. It was developed for compatibility with ASCII. The first 127 characters are directly encoded as a single byte. Because the first 127 Unicode characters match the original 7-bit ASCII encoding all ASCII text is automatically valid UTF-8 text. Characters above 127 are encoded as a series of blocks with the most significant bits of each byte used to encode the sequencing. The first block will have two or more 1s followed by a 0 with the number of 1s indicating the number of bytes in the sequence. Subsequent blocks will have 10 as their most significant bits. The bits of the character are encoded in the remaining bits.

Character First Block Second Block Third Block Fourth Block
A U+0041 0x41 NA NA NA
Σ U+03A3 0xCE 0xA3 NA NA
😊 U+1F60A 0xFD 0x9F 0x98 0x8A

The bytes for U+1F60A are calculated by first determining the number of bits required to represent the character. 0x1F60A is 0b11111011000001010 which has 17 bits. 3 bytes provides 16 character bits so 4 bytes are required. The value is padded with 0s to 21 bits and then slotted into the pattern 0b11110xx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx.

UTF-16 represents Unicode characters as 1 or 2 16 bit blocks. It was developed for compatibility with existing UCS-2 implementations. All UCS-2 characters are valid UTF-16 characters and require only 2 bytes. Additional characters are encoded using surrogate pairs. If a 16 bit block has a value in the range 0xD800 to 0xDBFF it is a leading or high surrogate pair and should be followed by the trailing or low surrogate pair in the range 0xDC00 to 0xDFFF. The character value is determined by subtracting the base surrogate value from each pair, 0xD800 and 0xDC00 respectably, then combining the resulting values as two 10 bit chunks and adding 0x010000.

Character First Block Second Block
A U+0041 0x0041 NA
Σ U+03A3 0x03A3 NA
😊 U+1F60A 0xD83D 0xDE0A

The surrogates for U+1F60A are determined by first subtracting 0x010000 from the value to get 0xF60A which, extended to 20 bits, is 0b00001111011000001010. Adding 0xDC00 to the least significant 10 bits gives 0xDE0A which is the low surrogate pair. Adding 0xD800 to the next 10 bits gives 0xD83D which is the high surrogate pair.

UTF-32 represents all code points as a series of 32 bit blocks which is enough to directly represent all current Unicode characters. UTF-32 is identical to UCS-4 but named using the transform pattern to match the other UTF encoding schemes

UTF-8 and UTF-16 are more space efficient than UTF-32 since most characters will only require 1 or 2 bytes. They are also never less efficient as characters can at most use 4 bytes. This space savings comes at the cost of complexity. With variable width characters it’s no longer possible to find the number of characters in a string or the Nth character without reading through the string. Since computers have become more powerful and the transmission of data more common this trade off is acceptable without limiting the number of characters that can be represented by a single encoding.

2015-06-07 - Domestic Flights

Recently I had the chance to go on a short domestic flight. Most of the flights I’ve been on have been long international flights and by comparison this flight was amazing. A little bit of security at the start and then it’s just get on plane, get off plane, go home. No serious security officers scrutinising you. No customs agent prying at you. Nice, simple, and easy.

Even the flight itself was great. I personally enjoy flying, but there’s only so long that it feels comfortable to sit in a tiny seat. With this flight they had barely finished serving snacks when it was time to land.

Good times.