Not long ago, in 2009, a behavior in the system (i.e a mode confusion) was part of the events that led to the loss of Air France Flight 447. The pilots reacted to a loss of altitude by pulling on the stick, which would have been an appropriate reaction with the autopilot fully enabled, which would then have put the aircraft in a climbing configuration. However, the airplane’s systems had entered a mode of lesser automation (“direct law” in Airbus terms) due to a blocked airspeed sensor, allowing the pilots to put the plane in a nose-high stall configuration, from which they did not recover.
We have come a long way in systems to build reliable software and techniques, however, systems still fail all the time. What makes some systems more prone to failure than others?
Often times, we attribute failure to complexity. That’s a fair answer but the experience and evolution of software says there’s more to it. Running large (in some cases, literally the largest), complex systems for more than a decade, one pattern I repeatedly see is failure modes or modes in general. And when not done right, modes can make a system intrinsically unstable. Every system has failure modes but the most common and nasty ones are introduced by bimodal behaviors.
In the book “The Better Angels of Our Nature”, Steven Pinker talks about how today we may be living in the most peaceful time in our species’ history, despite what the news tells us. (highly recommended if you haven’t read)
I know, it’s a cheesy, weird parallel to draw here (with system failures) but today we (systems operators) may be living in the most peaceful (i.e. less oncall pain) time in our species’ (systems) history. That’s because years of academic research has gone into this very topic.
A mode is a distinct setting within a machine interface, in which the same user input will produce perceived results different from those that it would in other circumstances. e.g. for vi, there’s one mode for inserting text, and a separate mode for entering commands (sorry, Emacs users but I’m sure you get the point). These are fairly benign modes that you deal with everyday and are mere nuances for beginners.
However, there are modes that can cause actual production downtime. You may recognize some:
Take these failure modes seriously. Bimodal/fallback behaviors are harder to test. They exercise your system in ways where “fallback path” or “secondary mode” will become less tested over time. Your primary mode will become resilient, but the day the fallback behavior kicks in (and it has latent issues), your system availability will be at risk and you will have nasty outages.
Here are some alternatives to avoid bimodal behaviors in the examples I shared above:
Avoid bimodal behaviors when building systems. Know your failure modes. Fail cleanly and predictably. It’s a simple concept that will bring more “peace” in running systems.