On Tuesday, January 10th, the US Federal Aviation Administrations (FAA) Notice to Air Missions, or NOTAM, system, went down overnight. As a direct result, no planes took off overnight or for most of the next morning. The FAA restarted flights by 10AM the next morning.
That seems like kind of a big deal.
The system is designed to warn pilots of abnormalities on their flight plan. So, for example, if you are flying on instruments in bad weather from Frederick, Maryland, to Cumberland, it might tell you a runway light was out, or that a runway was closed due to construction. Runways are, after all, made of asphalt or concrete, and need to be repaved. Planes that have engine trouble can get stuck, and you don’t want to land there — hence NOTAM. No NOTAM, no take off, for all but the most simple of short-flight, small-airport, visible signal pilots. Even then, when the FAA says every plane is grounded, you do not take off, as the last grounding of this magnitude was September 11th, 2001.
Given that NOTAM is a little system that tells you a beacon is out. Yet it took down the aviation industry. This reminds me of an old poem, “for the want of a nail.”
For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the message was lost.
For want of a message the battle was lost.
For want of a battle the kingdom was lost.
As to why this happened, let’s talk about values in software.
Reliability Vs. Resilience
It was a different century – the 20th – when I graduated from college and took my first job interview. The company was Purdue Farms, and they still had an old IBM mainframe running the show. Packaging, which the software managed, was the bottleneck for delivery. When the system was down, it could not calculate the weight of the chicken and print a label. The company had done some math, and every hour the system was down cost something like twenty thousand dollars. Back then, that was a lot of money.
The solution to that problem was to design for reliability. That is, the system could just never go down. The mainframe used a giant hard drive with huge platters, so thick that a hard drive read would only be wrong once in a hundred years. The box that housed the mainframe was big, heavy, ugly, and black.
The alternative to reliability is resilience. That is, instead of asking how we make sure the system never fails, we switch our emphasis to understand how we can recover quickly in the event of failure. In hardware, we’ve done this for decades. Even during my interview, I suggested a Redundant Array of Independent Disks (RAID) system. A RAID system might have ten very-cheap hard drives. Every time data is sent to the drive, the controller writes it to at least two of the disks, perhaps three. When one fails, the controller can just read from the others. Also, a physical LED light might turn on. A human takes the drive out like we used to take out an audio cassette, puts a new drive in, click a button, and the controller starts to use it. As long as those drives are one thousandth the price, they can fail a hundred times as often and Purdue farms could still save money.
Let’s talk about the FAA’s NOTAM system.
Reliability and Math
As a cadet in Civil Air Patrol In high school, I was familiar with checking NOTAM for clues on search and rescue missions. From what I can tell, it is the same system. That makes NOTAM perhaps sixty years old. I mean ollld. The text might come through your web browser or phone today, but it still uses the same antiquated codes that used to run over a teletype, from the days when telegram space was expensive. If NOTAM was designed to fail once every hundred years, then it would have around a sixty percent chance of having already failed by now. The exact failure doesn’t matter. In this case it looks like a corrupted database file.
Meanwhile, did I mention that NOTAM was old? Here’s an example of what a pilot actually gets from the system. A former National Transportation Security Board Chair once described them as “A bunch of garbage that nobody pays attention to.”
So we have the whole aviation system taken down by the loss of a software system of questionable value.
Even if we agree that it was best to shut down aviation over what transportation secretary Pete Buttigieg called and “abundance of caution” – why did it need to stay down so long?
Notification. Backup. Restoration. Reboot. Cutover. We’ve talked about these things in software for thirty years, in hardware for much longer. More than that, there are a large number of new software engineering techniques, such as blue/green deploys, feature flags, high availability, canary testing, notifications on failure — these things are just day to day reality from the ground up. We are even starting to see, to a limited extent, AI intervention on failure. Today that last category is more promise than reality. Still, today, an AI support program might be able to identify and resolve a race-condition database lock by two queries by killing them both. Still, the combination of features enables a continuous delivery of features, instead of delays, reviews, and extended testing cycles.
We’ve been doing this at Excelon, while telling others about it, for a decade now. The ideas have infiltrated Silicon Valley culture; new startups just build this way. Established companies may wait until they have a cascading failure like the FAA. Even then, the solution may be a total rewrite that never gets done when a refactor would be more appropriate.
There are two things to talk about here. We want to talk about how to design software so that it can recover from failure. Yet we also want to design human systems so a single software failure doesn’t stop a million people from flying today.