2021-04-17 - DataTypes: Bits
Binary Digits or Bits are the simplest data type used by computers. They are can either have a value of 0 or 1 and all digital data is based on them. How the data is actually stored depends on what you are storing it on. Inside of a computer bits are stored and transmitted using voltage levels. The actual voltages and which state represents which value are system and situation dependant but in all cases there are two states and one state is a 0 while the other is a 1. Hard drives, tape drives and floppy disks use magnetic polarity to encode bits. Optical media like CDs and DVDs use pits and the absence of pits to encode bits. As long as you have something that can have one of two states it can be used to store or transmit a bit.
but a bit on its own isn’t that useful as it only has two values so in most cases you have a series of bits. The combination of the states of these bits is used to encode data using a variety of formats. The number of possible states is calculated as 2 to the power of the number of bits you have. If you have 1 bit that’s 2 to the power of 1 or 2 states (0, 1). If you have 4 bits that’s 2 to the power of 4 or 16 states (0000, 00001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111). What meaning you give to these states depends on what you are using them to represent. We’ll get more into that in later parts. For now I want to talk about terms for groupings of bits.
Bytes
The meaning of a byte is determined by the system you are using but typically it’s the number of bits required to store a single character on the system and/or the minimum addressable number of bits. Typically on modern computers a byte is 8 bits but other systems may use different values. For example a large number of mainframe computers had 6-bit characters and so they used 6 bit bytes. The 8-bit byte comes from ASCII representations which use 8 bits and the use of 8-bit CPUs for early microcomputers.
The unambiguous term for 8 bits is an Octet
Words
Again the meaning of a word is determined by the system but it is typically the native size of the registers, single value memory locations, inside of the CPU. Usually but not always this is also the size of the data bus, circuit paths coming from the CPU used to send/receive data and the size of the address bus, circuit paths coming from the CPU used to specify which memory location is being written to or read from. For example modern 64-bit CPUs have 64-bit registers, excluding large multi-value registers, and 64-bit wide address and data buses. Although this isn’t universal, for example the Intel 8088 used in the original IBM PC has 16-bit registers but an 8-bit data bus and a 20-bit address bus.
The meaning of a Word can also be determined by the software environment you are running. For example in windows development a Word is always 16 bits even on 64-bit versions of the operating system. This is because windows started as a 16-bit OS and to maintain backwards compatibility the meaning hasn’t been updated.
Larger
Larger collections of bits are usually specified using prefixes although this can be confusing as historically two prefix schemes have been used.
The SI unit system uses a set of prefixes corresponding to powers of 10. k or kilo means 10^3 or 1000, M or Mega means 10^6 or 1,000,000, G or Giga means 10^9 or 1,000,000,000 etc. These prefixes with the standard meanings have been used for collections of bits and bytes but often a binary version is used. In the that version k = 2^10 or 1024, M = 2^20 or 1,048,576, G = 2^30 or 1,073,741,824 etc. Note that these values are close but not the same as their decimal counterparts. This can lead to confusion, for example Hard Drive manufactures often report sizes using decimal prefixes while windows reports them using binary prefixes. This is how a 250 GB hard drive can turn into a 232 GB drive.
To deal with this confusion an alternative prefix system has been developed that is exclusively binary. ki or kibi means 2^10, Mi or Mebi means 2^20, Gi or Gibi mean 2^30 etc. This system is slowly catching on as it removes confusion but it’s no where near universal.
When using abbreviated units a lowercase b means bits and an uppercase B means bytes. So MiB is a mebibyte while a Mib is a mebibit. You can multiply or divide by the size of a byte on your current system to convert between them.
2021-03-20 - Adventures in Partitions
No, this post isn't about Poland
I've always been fascinated by the history of programming and to that end I recently bought myself an old computer. I installed Windows 98 SE, Windows NT 4.0, Windows 3.1 and OS/2 2.1 on it and installed a variety of programming packages. Previously I was using virtual machines to host the OS but I found the virtual screen difficult to read and the virtualization program has compatibility versions with windows updates. Having dedicated hardware means things can run full screen and I shouldn’t have to deal with updates.
It's been an interesting experience setting up all these OSs. For one thing I learned things I didn't expect to learn and for another things I expected to be problems weren't. My primary concern with setting up this system was drivers. I envisioned days spent trying to get things to work and googling obscure error messages but that hasn't really been the case. For the most part things just worked and I was able to find drivers for the things I wanted. Dell had drivers downloads for both Windows 98 and Windows NT 4.0 and I even found USB mass storage drivers for both of those as well. I also found a tool that patches the SVGA driver for Windows 3.1 so that you can run it at a resolution above 640x480. I am missing some drivers, like Windows 98 SE can't read NTFS partitions but Windows NT 4.0 can read FAT32 and both can connect to the network and read USB sticks so that’s not a huge issue. When I get around to working with OS/2 I want to try and figure out how to get it to read the CD Drive and display at a higher resolution but those issues don’t stop it from working.
What I did have a problem with is hard drive partitions.
Firstly do you know how computers boot from a hard drive? well it turns out that it's a three step process. First you have the Master Boot Record (MBR) which sits at the start of the hard drive. The computer executes this section first and it loads partition information and passes execution off to some other bit of code. The actual operations performed depends on the MBR installed. A basic one will just find an active partition and execute the Volume Boot Record (VBR) while a more advanced one will switch over to a boot manager program. The VBR works the same as the MBR but for a partition and that is more OS specific. The VBR locates, loads and starts the actual OS.
The other thing I learned about was how partitions are defined and how the computer requests data from them. It turns out that the MBR has space for four partition slots which are stored after the start up code. These partition slots contain information about where the partition is on the disk, how big it is, and what kind of partition it is. This limits the maximum number of primary partitions on a disk to four. You can have extended partitions which are basically partitions containing other partitions but those caused me issues so I never used them. Newer hard drive setups replace the MBR with something more expandable but that's not really relevant to this old computer.
Now on to accessing data. Originally data was accessed on a hard drive using Cylinder-head-sector addressing. Hard drives are made up of a stack of platters. CHS forms a kind of 3D coordinate system for locating data on these platters. The Head value is a vertical coordinate and selects which platter and which side of the platter to get data from. Head is the term for the component that reads the data from the platter so by selecting which head to use you select which platter to read from. The Cylinder or Track value is a radial value which indicates a ring on the platter to get data from. The Sector value is an angular value which indicates which section of the ring the data is in. This system was used because early hard drives were rather simple and so the computer had to tell them exactly where to find the data they wanted. As hard drives got more advanced, and specifically as they got more built in controller logic, this scheme was less necessary. CHS was eventually replaced by Logical block Addressing which accesses data on a hard drive using a single numerical index and leaves it up to the hard drive itself to figure out where that block of data actually is.
The reason this is important is because the format you have for encoding these addresses determines how large of a hard drive you can access. The original IBM BIOS implantation of CHS had 10 bits for cylinder, 8 bits for head, and 6 bits for sector. With a 512 byte sectors this gives 8064 MiB (63 sectors x 1024 cylinders x 256 heads x 512 bytes) of addressable space. There's only 63 sectors in a track because numbering starts at 1. This was replaced by 28-bit LBA which allows for 268,435,456 sectors or 128 GiB and later 48-bit LBA which supports up to 128 PiB. One more wrinkle though because the MBR only has 4 bytes to store the size of a partition. If we are using 28-bit LBA that's fine but with 48-bit LBA we lose 16 bits which limits the maximum number of sectors in a partition to 4,294,967,296 or 2 TiB.
The hard drive installed in the computer is 232 GiB (250 GB) but the BIOS and the partition manager I was using only sees it as 128 GiBs likely because they are using 28-bit LBA. FDISK for Windows 98 reported the drive as only being 65,535 MiB but that’s likely because it’s using a 16 bit value somewhere. Windows NT 4.0 reported the drive as being 8064 MiB likely because it was using CHS. The other problem with the Windows NT 4.0 setup program is that it can only create 4 GiB NTFS partitions because it first creates them as super sized FAT 16 partitions for some reason. The OS itself can create larger partitions but those have to be created after you have it installed. There’s also apparently a bug where the main NT OS files have to be within the first 8064 MiB of the drive or the loader can’t find it. DOS and Windows 3.1 were surprisingly easy to setup. The FAT 16 implementation used by them can only be 2 GiB so I created a partition of that size and they happily installed into it. I tried the same for OS/2 but it saw the partition as only being 32 MiB for some reason and got really confused about the other partitions. I ended up having to let it create its own 32 MiB partition and then expanded it to 2 GiB afterwards. It seems to be okay with that.
But now I can programing in C, C++, QuickBasic, Visual Basic, ASP, Pascal and Assembly so that’s nice.
2021-02-20 - In IL: Assemblies
So far we've mostly been looking at instructions. Instructions form the smallest part of a program, but you can't execute a random IL instruction on it's own. To see how instructions fit together we need to pull up and start looking at things from the outside in. To start with we are going to look at assemblies.
A .NET program can be thought of as a collection of assemblies. Assemblies are individual files, either executable (.exe) files or library (.dll) files, that each contain a collection of types, methods, and data. We'll get to all that in a bit but first let's look at the Assembly information contained within an executable. To do this we're going to go back to part 5 and take a closer look at the compiled code. To refresh your memory here's the C# program from that part.
Now we're going to compile this program and then look at the decompiled file but instead of looking at the contents of the main method we're going to look at the information added before the class is declared.
We have two assembly declarations here. The extern declaration is used to indicate a referenced assembly. In this case the program references mscorlib which is where all the basic types and method are declared. The second declaration describes the assembly we built. You can see that the assembly directive contains a bunch of attributes that describe the assembly itself such as it's name, the version of .NET it's built for and its version. Some of these are set based on the build options of the project and some are based on the values in AssemblyInfo.cs.
Finally we have a module declaration. Assemblies are built from a collection of modules which can be thought of as files although these don't seem to map exactly to source files. It's likely that visual studio does some work to combine all the source files before actually building the assembly. There are also some other directives such as the .subsystem directive which says if this is a graphical application or a console application. These describe how the assembly was built and how it's meant to be run.
Now there's a lot of things that could be talked about with assemblies but I'm going to hold off on that for now as they aren't directly connected to the code we right. I might come back and explore the options more in the future.
Next time we will start looking at class declarations.
2021-01-23 - DataTypes: Introduction
This series is going to look at how data is stored in a computer. We're going to start with simple things like numbers and references but eventually we will work up to more complicated data structures like arrays and queues.
My goal with the first part of the series is to get a solid grasp on how computers store simple values and touch a bit on how operations on those values are performed.
The second part will involve some more code and examples of simple versions of more complex combinations of values. These combinations of values form the basis for many of the containers that we use as programmers so I want to cement the understanding of the basics.