2014-05-03 - Character Encoding: ASCII
Character Encoding
- Part 1 - ASCII
- Part 2 - Extended ASCII
- Part 3 - UCS-2/UCS-4
- Part 4 - Endianness
- Part 5 - UTF-8/UTF-16/UTF-32
- Part 6 - Conclusion
Computers only really understand numbers. They take numbers as input, perform operations based on numbers, and produce numbers as output. This creates a problem because people don’t really understand numbers. Computers were developed primarily so that people wouldn’t have to deal with numbers. People tend to work better using a wide range of characters of which numbers are only a small subset. This means that there is a need for a way of encoding characters as numbers so that computers and people can understand each other.
Enter the American Standard Code for Information Interchange (Often abbreviated as ASCII). ASCII is a seven bit character encoding scheme created in the 1960s and inspired by earlier Teletype encoding schemes. ASCII has since become very popular for use with computers. ASCII groups related characters together and orders them in a meaningful fashion where appropriate. This makes certain operations very simple and aids in sorting. For example the number '0' is encoded in ASCII with a value of 48 and the numbers '1'-'9' occupy values 49 to 57. So for a character known to be a number the value can be determined by subtracting 48. The lowercase 'a' has a value of 97 while the uppercase 'A' has a value of 65. Since both lowercase and uppercase letters are ordered in the same manner and without breaks the conversion between them is just the addition or subtraction of 32.
Range | Description | ||
---|---|---|---|
Decimal | Hex | Binary | |
0-31, 127 |
0x00-0x1F, 7F |
000 0000-001 1111, 111 1111 |
Control characters: This includes some formatting characters like backspace, tab, and line feed as well as some obsolete Teletype characters like Bell. |
32-47, 58-64, 91-96, 123-126 |
0x20-0x2F, 0x3A-0x40, 0x5B-0x60, 0x7B-0x7E |
010 0000-010 1111, 011 1010-100 0000, 101 1011-110 0000, 111 1011-111 1110 |
Symbols: This includes punctuation, the space, and other symbols such as the ampersand, brackets, and slashes |
48-57 | 0x30-0x39 | 011 0000-011 1001 | Digits: '0' to '9' |
65-90 | 0x41-0x5A | 100 0001-101 1010 | Uppercase Letters: 'A' to 'Z' |
97-122 | 0x61-0x7A | 110 0001-111 1010 | Lowercase Letters: 'a' to 'z' |
As the name implies ASCII was created primarily for use with English (American) computers. It allows for all the standard American English characters but has no room for accented or non-Latin characters. This means that extensions are required for non-American and non-English use. Most character encoding development since ASCII has been to find a way to support these additional characters.
Comments: