The ability to quickly and effectively find and resolve bugs in new and established systems is one of the most valuable engineering skills that you can develop. Since this skill enables the rapid development and maintenance of high-quality engineered systems, it is foundational for many technology companies, and is one of their most valued and sought-after skills. Nevertheless, this skill is rarely evaluated in coding interviews, and is often poorly understood and documented.
I have debugged and resolved many difficult problems in many different types of systems, including massively complex computer processors, multi-threaded servers and apps, and troubled individuals, families, and organizations. The optimal procedure for finding and fixing bugs is essentially the same across all domains. Surprisingly, many software engineers do not have a clear understanding of the process. I intend to address this deficit now. Here is my treatise on debugging.
When something is not working as expected, it’s easy to assume that everything is broken. Take the time to find the things that are working in the realm of the problem. This will help to circumscribe the problem and create a clear picture in your mind of its edges.
In the process of determining what is working, you will catalog a set of operations or behaviors that do not work. Spend time fleshing-out this list. Be clear on precisely how the system is not working as expected. Even though it might seem like these first steps are pointless because the problem is “obvious,” often jumping into solving the problem too soon can lead to wasted time and effort, and to a partial or non-optimal solution.
Problem behavior that is discovered in a complex situation can be hard to reproduce or generalize, especially when there are non-deterministic or statistical effects. Any attempt to simplify the test case whilst retaining the problematic behavior is always time well spent.
For example, if the problematic behavior occurs when processing a very large dataset, you may want to try to reproduce the problem with increasingly smaller datasets. Of course, this example would not be possible if the problem is related to large datasets. In that case, creating a simple though still large dataset might make more sense.
By incrementally paring-down the situation where the problem arises, you not only increase your clarity about precisely what does and doesn’t work, but you also naturally start to construct hypotheses about what might be causing the problem.
Simple test cases are useful to precisely communicate the bug with others, to quickly test whether changes affect the bug, and may also become part of your anti-regression tests (see step 7). Since simple test cases can usually be run quickly, they also support hypothesis testing (see step 5).
You might arrive at this point after minutes, hours, days, or even weeks of work. No matter how you got here, or how long it took, you will now have data, and you will have learned something about the way that the problem manifests. This knowledge enables you to form hypotheses about what might be causing the problem. These are theories about what process inside (or even outside) the system might be leading to the observed problematic behavior.
Taking each hypothesis in turn, dive into the system and find a sub-unit where you believe that something may be going wrong. Then run your small test-case and look at the internal behavior before and after that sub-unit. If you find a problem before that sub-unit, then your hypothesis may have been wrong, and you at least know that you need to investigate further back towards the input of the system. If, on the other hand, the input to that part of the system seems correct, but the output seems incorrect, then you have support for your hypothesis, and you can zoom-in more closely.
At this point, if you are not fully clear on what the bug is, then loop back to step 1 on this identified sub-unit.
It’s possible at this point to apply divide and conquer very naively: split the system arbitrarily into two halves, look for a problem in each half, and then recursively zoom-in on the non-functional half. I don’t recommend this approach because it is usually very slow and cumbersome.
On the other hand, it’s possible to save a lot of time and effort by using hypothesis-driven divide and conquer, as described above. You still check whether behavior is as expected just before the sub-unit that is hypothesized to be broken, but, if things are functional there, you go straight to the output of that sub-unit. This enables very rapid zooming-in on the bug.
Only proceed to the next step once you’re clear about what the bug is.
Sometimes bugs are caused by simple typos, or one-off misunderstandings, and these kinds of bugs can just be fixed in isolation. However, it’s much more common for bugs to be representative of a much larger class of problems.
After spending the time and effort to get to this step, you will usually have an incredibly clear perception of the relevant parts of the system and of the problem. You will be the world-class expert on this bug. For this reason, now is the time to leverage all of that knowledge. In a month, you will no longer have this clarity of perception with respect to this specific problem.
So spend time now to fully leverage your investment. Think about and document the overall class of bug, and determine if the system will likely manifest other expressions of the underlying issues, whether or not those particular expressions have been manifesting for users.
We don’t want to stick a band-aid on a malignant tumor and send the patient home.
Even if you don’t design systems using test-driven development, I recommend that you use test-driven bug fixing.
Make sure to write unit-level and/or system-level tests that exercise as much of the bug class as possible. Make sure that the tests that you expect to fail do in fact fail. The main reason that the bug exists at all is that there were no tests to catch it. This means that there was a hole in the test suite. I often say that if something is not tested then it’s broken. This is because you have to assume that it’s either broken now or that it will get broken at some point in the future, and then the first person to discover that it’s broken will be a customer.
Since you have a broken system right now, now is a perfect opportunity to develop tests and ensure that they fail. These opportunities don’t arise often, so grasp them while they are available.
I like to call regression tests anti-regression tests, because they prevent the product from regressing to an earlier, broken state. Run your test suite with all of your tests before releasing new revisions of your product.
If you have been diligent, fixing the bug(s) will now be extremely easy; it’s just a formality.
This kind of bug fixing can be performed very calmly and confidently. The fix is wrapped in a high-quality software engineering process, a process that informs and tests it. In contrast, I have witnessed engineers operating at the opposite end of the scale, changing code almost randomly in the hope that it will fix the overall problem. That kind of approach is more likely to introduce new bugs than to fix the existing ones.
While fixing the bugs, you might notice other problems. In that case, also loop back to an earlier step, such as step 6.
All of the new tests should now pass. If they don’t pass then you’ll need to loop back to an earlier step and resolve the issue.
At this point, it should be possible to run the simple test cases that you developed in step 3, and they should be working properly. If not, then loop back to an earlier step to resolve the issue.
You should now be able to perform the behaviors originally reported to be problematic, and you should no longer see an issue. If you do see an issue, then return to an earlier step to resolve it.
You have just performed an extremely high-quality set of engineering maneuvers. This is the stuff that legends are made of. It’s possible that you are the only person who is aware of your heroic actions. Write them down so that they can become a part of engineering lore. Document code, document the test-plan, document the test suite, write a wiki page or a blog post. Do something to capture the wisdom that you have developed and to make it available for others. Your documentation will also educate and mentor others. You will be setting a good example to other engineers, an example of both how to use resources effectively and efficiently, and also of how to execute challenging engineering work in a way that is deeply satisfying and nourishing to the soul.
During the time that you have been focusing your attention on resolving this particular issue, you may have noticed other potential classes of bug, and possibly other manifest classes of bug. File bug reports for issues that are manifest in dysfunctional behavior, or that you’re certain are lurking undetected. For other possible classes of bugs that may not be present but may also not currently be tested for, take whatever action is necessary to direct testing effort towards them. For example, you might update a test-plan ideas document.
Release your fix, either internally or externally, and make sure that everyone knows what you did. Summarize the problem and the solution succinctly, and include links to the documentation that you created.
You just did some awesome, high-quality engineering. Pat yourself on the back and head off to do something else that’s outstanding.