A short, motivational guide to debugging your horrible production issue.
You’ve been tasked with finding a bug that has reared its ugly head in a production system.
You’ve been staring at the keyboard for hours now with growing doubts in your mind about your suitability for a software engineering career; you have drawn a complete blank.
You feel like a fraud, an impostor.
Relax. You aren’t the impostor here — the bug is.
For years now it’s been happily impersonating a functioning part of the system — but now you’re onto the trail to find this fake, this fraud, this impostor — and you won’t give in.
So where do you start?
Reproduce the problem.
At an absolute minimum, you need to be able to take the problem and confidently reproduce it.
Again and again…
Without this very simple step achieved, you can’t test whether a change you’ve introduced has made any sort of impact, you can’t expose vital clues from the system; you can’t even eventually prove to someone else that the problem has been resolved.
Reproduce the problem in an environment you can control.
So you can reproduce the problem? That’s a good start. But if you can only reproduce the problem in production — you’re going to be a bit constrained in the approaches you can use to narrow down the source of the problem.
Hopefully you’re in a situation where your organisation has a staging, UAT, pre-production or test environment that somewhat resembles production.
Use it to your advantage and try to reproduce your problem there. Being able to do so will open up so many potential approaches you can use to get more visibility and control over your problem that you don’t have the luxury of in production — or at least the luxury of being able to do so quickly.
Here’s a small number of examples:
- Increase logging levels of applications
- Remote debugging of applications
- Ability to release modifications of application code outside of a regular quality gate to trial changes quickly*
- Suspend or terminate adjacent processes to determine flow of events
* — I think this should only be exercised in times of urgent need and you should always clean up any mess.
Understand the system.
Whenever you find a problem you can’t immediately solve, with a system you think you know — you need to face facts, you don’t know the system as well as you thought you did.
Take a step back and try to draw out the system at a high level. Gradually and continuously focus inwards on the parts that are most likely to be the culprit in the particular scenario you’re working on.
Simplify the problem.
Take the problem you’re working on and instead of focusing on tackling the problem directly, try to reduce the complexity and the number of moving parts to keep track of in your mind.
We are fragile, meaty beings with only a limited ability to store and process information. Make the problem more digestible.
In a production system, eliminate the components and behaviour that can’t possibly be contributing to the dysfunction with simple irrefutable tests.
Eventually you’ll be left with a much smaller and simpler problem to work on and comprehend — if you continue doing this, you will find the solution eventually.
Bugs DO NOT go away on their own!
…Oh and please don’t fall foul of this common logical fallacy:
I could reproduce it 20 minutes ago and I haven’t changed anything, but it seems to have gone …maybe it’s gone forever!
If someone has spotted a bug, you haven’t made any changes and at some point you couldn’t reproduce it, it will happen again.
If you make a change in order to validate a theory or introduce a potential fix, if the change has no effect, you should immediately roll back the change.
Don’t add one change on top of an ineffective change. Something you introduce may have no effect at first, but if you add more changes and you end up with a more complicated, messier problem to solve.
Worse still, you could wrongly attribute some change in behaviour with an unrelated change you’ve introduced. Don’t do it! Keep your changes isolated and test every change.
Rubber duck debugging.
Though it can sometimes feel like it — there is absolutely no shame in asking for help on a problem that is eluding you.
Practice Rubber Duck Debugging — the act of explaining your current problem to another person — even an inanimate object, to allow you to gather your thoughts and regain focus.
Sometimes the reason you haven’t solved the problem yet is precisely because you’ve been focusing on the problem too long and you’ve fallen down a rabbit hole.
Document everything you’ve already tried.
In a complex scenario with high stakes, it’s very easy to lose track of what you’ve done already to solve a problem. You could end up going back and treading old ground without even knowing it. Simply document everything you’ve done so far and as you go along.
I’ve tried everything and I’m still stuck!
You really haven’t. There is going to be something that can be done to highlight what the cause of your issue is.
You might be too close to the problem, you might never be able to think of the solution yourself. But you must absolutely accept that there is something that can be done to highlight what the problem is — you haven’t found it yet and your job is to find it.
Don’t give up.
Learn new debugging techniques.
Debugging isn’t just setting breakpoints and hoping for the best. There are lots of techniques to learn. One such technique worthy of attention (though there are many others) is the use of Conditional Breakpoints:
As with breakpoints, you can cause the execution of a program to halt — but with conditional breakpoints this is only if a certain condition is met.
This can really help save a lot of time in tracking down your elusive bug because when the program freezes you know it’s happened at a condition you’ve decided is important.
There isn’t always a clear way to solve the problem.
Sometimes you have to get creative…
Above all else, take regular breaks.
Solving a tricky production bug can be tough, even emotionally draining.
Go for a walk, grab a drink and get away from your desk. Do it now.
Sometimes it can make the world of difference to have your subconscious mind take the wheel and do some offline processing, while you distract yourself with something else.
And if you’re lucky, you might just get to that magical moment when…
You’ve Solved It!…
But it isn’t over.
Are you sure you’ve solved it?
Test and re-test your fix. Revert back to the original configuration and run your reproduction steps. Confirm the bug exists. Apply your changes and run the reproduction steps again. Confirm the bug has gone.
Do it again.
…are you really really sure now?
After all your pain and suffering, it’s easy to forget very simple confirmatory steps. Don’t cause yourself more heartache — be confident in your solution before proclaiming the issue is resolved.
…Oh and you should definitely write some tests now.
A bug being found in production can mean only one thing about your application’s test suite: it isn’t good enough because the bug wasn’t caught early enough.
Use the opportunity now to write some tests to exercise that area of code so no-one else will have to experience the suffering you’ve just experienced again.
…at least until a different bug comes up anyway!
Hope you’ve enjoyed this and it’s been useful. If you have any feedback I’d be grateful if you could share it.
Thanks for reading 😁!