Design for Debuggability

There is a part of writing a Linux BSP that I dread profoundly, and it's among the most trivial ones. Specifically, I'm talking about that part where you've written a new device driver, or modified something in an old one, or you just need to configure it. You've added the right incantation in the device tree, you boot, and nothing happens. The module isn't probed, or your changes are silently ignored.

What follows is an afternoon that you waste poking at the thick forest of device tree probing functions, until you figure out you hit some obscure corner case or missed a silly parameter.

Or have you ever booted a machine running systemd, and found out that a service wasn't started, even though its dependencies and preconditions look "right"?

You then waste hours figuring out exactly which part of the analysis that systemd-analyze critical-chain presents is wrong (and then agonizing over whether it's the analysis that's wrong, or if that's really a bug in systemd).

Another one. You have an entire system built over a message-passing infrastructure. The handler of one of these messages appears to be execute twice in a very rapid sequence, early at boot time, which would make it seem like it gets that message twice. But that message can be sent from hundreds of places in the code, by dozens of processes, including the one you're debugging. You can catch the handler, but who sent the message, and why?

All of these problems can be trivially solved through tracing, but it's hard to add that kind of tracing ad-hoc. I've done it once or twice for device tree code, back when the tooling available for it was far worse, and it took me a hell of a long time. That was on kernel 4.4, I think. So much has changed since I last looked at it that now, in 4.16, I might at best have a good idea about where to start, but no more.

Some systems, written by people with less free time on their hands, anticipate this sort of problems and try to solve them from the beginning. One particularly clever message-passing system I used had a "debug mode" which annotated each message's body with some useful information, including what process issued it and where from (this was done automatically), or what message it was sent in response to (but this required some developer assistance, in the form of a minimally-intrusive macro whose body was empty on production builds). Usually, if a spurious message was received, catching the handler was enough to tell you where the message originated.

There are two paradigms in the electronics industry that have been hugely influential to improving the reliability and quality of modern electronic systems: design for testing (DFT) and design for manufacturing (DFM). Their intent is probably clear from their name.

Similarly, although I don't suppose someone consciously adhered to best practices derived from this fancy name, some of the best software that I've used or worked on seems to have been designed for debuggability (among others). If something goes wrong, you don't need to jump through any hoops to figure out why.

This has many useful side-effects. Bugs don't linger in bug trackers because even difficult bugs can be tackled by people who aren't too familiar with the code and with how it evolved over the years. In community-ran projects, they don't linger because fixing trivial bugs is easy enough, as opposed to consisting of a hopeless, two-week hunt for an elusive condition, followed by a three-line fix. New features are easier and faster to develop because the early bugs you cause are faster to trace.

Granted, keeping a program simple and readable gets you halfway to making it easy to debug. Some things that I've seen completing the bridge include:

  • Comprehensive, informative traces that are easy to turn on and off through a stable interface (guess which major operating system that we all like gets this horribly and pointlessly wrong).
  • Good built-in introspection that's enforced through a simple interface (for example, using an explicit reference counting interface as opposed to relying on developers to manually do mystruct->creative_name_for_a_reference_counter++).
  • Loose module coupling, where most changes to a module's state ("local" state) are done through the module's code ("local" code), whose invocations are easy to trace to either user action or a small set of other modules.
  • Clean event handling interface that allows you to follow an event from detection to handling, even across multiple processes.

In most, if not all cases that I am familiar with, this design was developed informally, as a result of nothing more than engineering experience and common-sense. But then again, all good formal frameworks start out as good informal principles.

(Although, as ACPI has shown, not all good informal principles develop into good frameworks...)