On the importance of clear technical specifications

Even when the code is working like a charm, technical specifications – and their different interpretations by different people – can lead to confusion and hours-long debugging sessions.

I was recently working on a driver¹ for a piece of hardware that contained three sets of registers² that were very similar: the layout of each of the registers were the same (bit for bit) but their meaning was slightly different. The documentation for the device described the registers in three separate places: the first contained a list of all the sets of registers and their relative addresses, the second contained a list of registers for each set and the third described each register in detail. In these three places, the order of the sets of registers was different.

This kind of detail can cause a fair bit of confusion: the way I had mapped the registers³ followed the second description, which, to me, was the clearest one of the three. The registers looked fine when I polled them⁴ but something was amiss with either the logic of my driver or the logic of the device.

In order for the device to work properly, one of the sets of registers had to have exactly the right value – it had to correspond to a different, otherwise unrelated register value in the same device. Both values were “moving targets”⁵ so my driver had to time reading the first and writing to the second exactly right. Of course, because I had picked the wrong order to follow, the device didn’t work properly.

Luckily, this wasn’t one of those devices that, when they don’t work properly, start to smell badly (devices do that when they short and start to burn) and the problem was easily fixed by moving a few lines of code around. We also found the problem pretty quickly: one phone call to the guy who wrote the technical specification for the device, and a hunch about what the problem might be, pointed to the issue immediately. Still, such problems are avoidable.

A similar problem occurred, on the same device, about a week earlier: the specification talked about “masking” a certain number of bits in its registers. The definition I have of “masking” a bit (i.e. applying an AND⁶ operator on it) was different from the definition the firmware developer had (he applied a XOR⁶ to the mask), so all my bits were the inverse of what they should be – at least as far as the control registers were concerned. Suffice it to say that didn’t work out of the box either.

Combine these problems with errors in the device’s schematics (signals that were indicated as normally high but were really normally low and vice-versa) and we have a lot of confusion where everybody thinks they’re doing what they’re supposed to be doing, but the devices just don’t seem to agree.

A driver is a piece of software that talks directly to a piece of hardware and abstracts the function of that hardware for the operating system. For example, a driver for a hard disk provides an abstraction to the operating system that allows the OS to use the hard disk regardless of the way it needs to talk to the hard disk – because the driver does all the talking. ↩
A register (like the one I decribe here – there are different sorts) is one of the ways software can talk to hardware. To the software, they look like any other bit of memory, but unlike other bits of memory, the values are communicated directly to/from the device. That way, the precise way the computer talks to the device (e.g. through some kind of protocol, such as PCIe) is invisible to the software. ↩
In C and C++, we access variables through their names. Mapping registers basically means assigning a name to each register according to where they are in memory. ↩
Polling a register, in this case, consists of reading them with a special program that bypasses the driver’s normal logic and shows the contents of the registers to the developer – me, in this case. Such techniques can be very useful when debugging drivers. ↩
In that they changes automatically and fairly quickly ↩
Applying an AND operator on a bit with a mask, the result is ‘true’ if both the bit and the mask are ‘true’ – false otherwise. I.e. 1 AND 1 is 1, 1 AND 0 is 0, 0 AND 1 is 0, 0 AND 0 is 0. ↩ ↩²