Nightmare at 20,000 feet is one of the most iconic Twilight Zone episodes of all time. It tells the story of Bob Wilson, a salesman with a nervous condition. Bob peers out through the window of an airplane. He is moderately surprised to see a gremlin milling about on the wing.
Bob makes increasingly frantic attempts to show the gremlin to other passengers. It’s no use. When anyone else looks through the window, the gremlin disappears. To make matters worse, it deviously begins to dismantle the plane engine, putting everyone in mortal danger.
Sometimes, working as a software engineer feels like flying a plane with hundreds of wings. Each user is constantly telling you about a different gremlin that they can see through their window. But when you look for yourself, you see nothing out of the ordinary.
We need to fix our bugs, even if they only happen to certain people under certain circumstances. Even if we don’t know what those circumstances are. We take full responsibility for the systems that we create.
Fixing these unreproduceable bugs is difficult, but often achievable. Here’s your survival guide for keeping the gremlins off your wings.
As an engineer, it’s crucial that you spend your time wisely at work. Many engineering teams apply in-depth methodologies to manage the new feature development work that they do. Team leads often know immediately when a given project is taking longer than expected. And we don’t expect engineers to power through the whole backlog in a single sprint.
But these same teams often fail to apply these same principles to work about bugs. Many engineers come to work without clear expectations. They wonder, “Should I be fixing my most important bug, or working on my most important user story?” The result is that engineers often fail to meet expectations around fixing bugs in a timely manner. Many times, these expectations are unrealistic to begin with.
And the engineers end up feeling guilty about the bug that’s been sitting in their inbox for a month…or six.
When a bug can’t be reproduced, you can’t expect to know how long it will take to fix. But you can estimate how long a given investigation will last.
For example, you might hypothesize that browser compatibility is at fault for your bug. Opening your application in Firefox to check this should only take a couple of minutes.
Writing down your hypotheses explicitly provides visibility into your process. It makes it easy for others to see that you’re actually working on a bug. At Asana, we use subtasks to track hypothesis on bugs as we investigate.
Sometimes hypothesis take more than just local debugging to figure out. Another strong technique is to use logging and assertions to get more clarity on your hypothesis.
For example, your bug might be caused by a value unexpectedly being null. In that case, you could add logging around places where the value is manipulated to get more clarity.
Using logging this way can be effective, but it’s also a very slow iteration cycle. To counteract that, always try and add logging for multiple hypotheses at once.
We think of engineering a solitary act. Part of this is defined by pop culture depictions of engineers as loners. I think that this is also influenced by people’s early experiences with engineering. Most people learn to code by working on solo projects. Many early experiences of learning and success for engineers are about experiences of working alone.
But working on a large project with several team members is a completely different world. It’s one that you can’t navigate on your own. To understand how a system is misbehaving, you ultimately need to understand the behavior of the people working on it.
We all want to be the hero who solves the bug all on their own. But that tendency backfires when we choose to sink hours into investigating complex bugs that aren’t going anywhere.
As you spend time working on a bug that you can’t reproduce, think about how you can gradually escalate the bug’s visibility as you work.
For example, you start by looking at something for 1 hour, after which time you include your tech lead on the task. Then, after you’ve worked unsuccessfully on the bug for 4 hours, try to loop in 3 engineers who might know what’s going on.
What you’re looking for here is not for someone to take over responsibility for the bug. It’s crucial that you maintain clear responsibility if the bug is assigned to you. Instead, you’re looking for people to remember important information that might be relevant to your search.
The most obvious form of this is “Oh, I broke this last week.” Another form would be “This reminds me of a another problem I fixed last year…have you tried checking whether ad blockers are causing it?”
The biggest challenge with escalating visibility of a bug is your own insecurity about not being able to fix the bug yourself. If you’re really uncertain, I suggest that you ask your manager. They’ll probably tell you that it’s your job to find whoever can help.
Take full advantage of the written context that your teammates are already creating.
Every code change that your team produces probably already has a commit message describing what it does. If a bug has started recently, consider scanning through all of the code changes that your team has made in the last few days. They might point you to areas of active development in the codebase that you don’t know about.
There are other sources of context, too. Check these things out to get a sense of what’s going on in the codebase.
The danger here, of course, is that these sources of information are totally bottomless. Don’t expect yourself to be familiar with everything that’s going on. If 15 minutes of reading through recent changes doesn’t yield anything, it’s unlikely that another hour will be any more helpful.
Bugs go into a bug tracker. A bug tracker counts bugs. We want to have zero bugs, right?
I’m going to have to ask you to give up on that dream. There’s basically no such thing as a bug-free system. Not all bugs are important enough to get fixed.
And yes, this includes bugs that are marked as “important.”
Don’t focus on fixing every last bug. Develop a clear and communicative relationship with the people who are bringing bugs to you. That’s how you can find creative solutions to technical problems, and understand what users truly need.
“At this company, we eat our own dogfood!” It’s awesome to work somewhere that you can use the software you’re creating. Dogfooding has incredible benefits for quality assurance, motivation, and user empathy. It also is an endless source of meaningless internal bug reports.
Take internal reports with a grain of salt. Internal reports are often the first way that we discover bugs, and when they’re accurate they can catch bugs before users experience them at all.
We’re naturally driven to help people that we know and interact with directly. So internal reports, even from a single person, often feel like the most urgent and important kinds of bugs to fix.
But remember how weird internal users are. Their usage of the app is probably not typical. They’re probably using a different release of the software than what’s on production. And they probably have a whole set of feature flags resulting in them hitting code paths which are vastly different from what happens for users.
So when an internal bug report happens for a user, and it can’t be reproduced, sometimes it’s wise to just wait to see if real users hit it. And, of course, always be ready to roll production back.
And remember that external users are weird, too. I think it’s acceptable to wait to see whether the bugs happen to at least three real users before investigating in-depth. (Of course, this very much depends on how severe the bug is).
Many users have strange software configurations, invasive browser extensions, and unusual network conditions. These often create problems that are impossible to reproduce. And often these problems are worth fixing. Of course, I’m assuming that you do have other important things to do.
Working on these kinds of bugs can be annoying and taxing for all parties involved. This includes customer support agents and your manager. Don’t make the mistake of working hard on an intractable problem without telling anyone about it.
We only like to deliver good news, as in:
But remember that it’s also good news to say:
This communication pattern is an essential tool for employees on the business side of things. They need to communicate frequently and clearly with customers. Don’t block them by being silent. If you spend any time working on a task, don’t walk away from it without writing some communication about it first.
It’s also important to build up communicative trust this way so that you can close or delay bug reports openly. If something is scheduled to be done by a certain day, and that deadline isn’t happening, don’t let it pass in silence. You will appear far more trustworthy if you proactively comment on the bug explaining why you’re moving the deadline. It also creates an opportunity for you to receive feedback if your decision to delay is wrong.
Working with unreproducible bugs sucks. But it doesn’t have to. Approach these problems with curiosity, collaboration, and lateral thinking. You might find that you actually enjoy the challenge. There’s so much to learn about the systems that we use, and so much to challenge and surprise us along the way. Focus on that, and you’ll be on track for becoming the best bug-basher on your team.
Special thanks to Justin Churchill, Bella Kazwell, Steven Rybicki and Mark Yao for their help with this post.