Are you fooling yourself about the root cause of your bug?
At the start of my last company, we had an obscure bug that would periodically bring down an entire data center for a large and popular Internet site. The bug was hard to find, in code that had successfully processed billions of transactions, and only happened at night. Worse, it occasionally caused a total outage. We knew we had to find and fix the problem right the first time so we didn’t lose a critical early customer.
Here’s how we found the bug and saved the company.
To save our opportunity with this customer, and likely the company itself, we only had one chance to fix the bug. That meant no guessing. We needed a disciplined, scientific approach.
The first part of this approach was to look at our evidence: core dumps, backtraces, log messages, packet data, etc., and formulate a hypothesis about the source of our bug. This would be our root cause hypothesis.
A critical idea in science is the notion of falsifiability. For certain kinds of statements, such as “all swans are white”, a single piece of counter-evidence for a hypothesis is enough to disprove it. That’s it, just one. If I see one black swan, I need to reformulate my hypothesis.
For our bug, if just one piece of data wasn’t explained by our root cause hypothesis, we might not have the right culprit… or at least not all of the culprits. And that would mean more potential outages. We’d either have to prove that the unexplained data was spurious, or refine our root cause hypothesis to explain it.
Our company, LineRate Systems, built high performance software-only load balancers. During a proof of concept with that popular site, things were going very well.
Every morning, the customer’s operation team would put the load balancer into production and start scaling it to more and more of the data center until we were operating at over 10 Gb/s of production traffic in front of hundreds of web servers. Exciting stuff. We’d finally built up enough confidence with the customer that it was time for some continuous testing, with overnight runs.
That’s when the problems began.
One night, all load balancing processes got stuck, consuming 100% of the CPU. A major outage, an angry customer, and a demoralizing setback.
We were back to daytime-only testing. We instrumented the system to log when the failure happened, and added a watchdog to dump core for the offending process so that we could analyze the failure. In the meantime, a new process was started to keep the system running.
Strangely, we never saw a failure during any of the daytime testing. So, with the new watchdog in place we went to night time testing again. (Patient and forgiving early customers are worth their weight in gold.) And sure enough, we started seeing processes being killed by the watchdog after a hang at 100% CPU. At least this time, this caused only some connection loss, but no major outages.
After collecting the core dumps and several hundred gigabytes of packet capture (remember, at 10 Gb/s, you get about 1 GB of packet data every second), we began our analysis. The core dump showed that we were stuck in an infinite loop inside our HTTP parser. However, this code had successfully processed billions of HTTP requests at this point, so the question was, “why was this failing now, and why only at night?”
Now, the hard work began. We built a tool that looked for HTTP requests that had no response within a certain time window, since that was a clue that a process might have hung and failed to respond. After several hours, the tool started spitting out packet sequences… and, they all looked the same.
Eureka! We’d found the smoking gun, we hoped. It turned out that every time the load balancer failed to respond to an incoming HTTP request, the HTTP header was quite large and had dozens of extra NOKIA-
headers. After replaying these headers through the parser, we were able to determine that the large headers were coming in over a slow network link, and were therefore very likely to have a buffer break in just the right place to trigger a parser bug that failed to advance the buffer pointer. This caused the parser to look at the same byte over and over again, resulting in 100% CPU usage.
This became our root cause hypothesis.
This was all very promising, but we couldn’t explain why this was only happening in the overnight runs, never during the day. Remember, we had to explain all the data, otherwise we may not fix the problem correctly, and we’d end up with a very angry customer and a big loss in credibility.
Fortunately, after brainstorming with the customer ops team assigned to our project, we came up with a satisfying explanation. All the failing requests were coming from mobile handsets that were connected to Nokia base stations, which we could deduce from the HTTP header bloat they added. Moreover, the connection latency and inter-packet latency were very high for the failing cases.
In the U.S., there are very few Nokia base stations, if any. However, back in 2011, major carriers in Asia were using Nokia base stations. A quick check of the source IP addresses of the offending HTTP requests were indeed all from Asia, and India in particular.
And there it was. If all the offending traffic came from India, it would explain the high latency and variable packet inter-arrival times given the state of the Internet at the time (and today, for that matter). It would also explain why we only saw the offending requests at night — it was daytime in India.
A quick review of the timestamps on all the offending requests that our packet analysis had produced showed that they occurred during peak usage times in India and Southeast Asia. And the times corresponded to the IP address location estimates as well. We’d explained all the data. Science!
The good news is if you are running a service where your company controls operations, you might be able to get away without explaining all the esoteric data. But you should be careful, because your fix might only mask the real issue and make it less likely to occur. Thus it will lurk, silently, waiting to cause a site-wide outage.
In our case, we had no choice. We only had a few opportunities to close customers in the early days of the company. Our software was absolutely mission critical and we didn’t get to control the deployment schedule.
Fortunately, a good scientific approach to debugging failures and good engineering practices lead to a happy outcome. We even ended up with product at a customer site where the customer had forgotten they had it installed in production. When they discovered the oversight, the system uptime was over a year with no monitoring, operator intervention, or outages. Oh, and that was a beta version of the product.
That’s the power of Science!
Originally published at Unbounded Systems.