Understanding the root cause can be the first step to finding the ultimate solution
Bugs are not restricted to software, they can manifest themselves in any type of system that can cause an error, flaw, failure or fault of some sort, just like regular Software Bugs. I can go as far to say that any system an observer can interpret as having a purpose can contain some sort of bug that will eventually make that system deviate from that purpose.
We can think of a few examples of bugs in real life:
- Cars are supposed to start but they can fail to turn the engine on
- The train is supposed to arrive on time but it can delay
- The traffic is supposed to organize the transit but it can cause accidents
- Teams are supposed to deliver value to the company but they can fail to do so
Bugs are not restricted to code, they exist anywhere where something bad can happen (and will happen, according to Murphy's Law). Every time we want to find a solution for a bug, we need to think beyond the code itself.
Any system that an observer can interpret as having a purpose can contain a "bug" that will eventually make that system deviate from that purpose
Even though there are no silver bullets to prevent any bug from happening, we can still look for strategies on what to do after they are discovered.
Before a bug is discovered we have no information about it, for all practical purposes, it doesn't exist. However, after we see the bug we are in a privileged position because we will have some information that can help us to emerge with more efficient actions to prevent that specific type of bug from happening again.
We need to understand the fundamental aspects of a bug instead of just looking over. It is essential to take the best out of the information we get and come up with efficient actions to tackle it. However, we need to be extra careful on how we do that because there are traps on the way.
There's a phenomenon called "Illusory correlation". It happens when an individual tends to perceive a relationship between events when there's no relationship at all. In Statistics there's a phrase called "correlation does not imply causation" to remind us that just because events are related that doesn't mean that one caused the other. We can make the analogy with the train that was delayed and the car that crashed, just because they happened very close to one another that doesn't mean the crash caused the delay. Unless there's additional evidence that points to that conclusion, the delay is totally unrelated to the crash.
There's also a difference between Proximate and Ultimate Causation when dealing with this category of problems. The proximate cause is the most obvious one and the ultimate cause is not so obvious but can hold the real reason why something happened (in a high-level form).
Let's take an example: Jane spotted a behavior in production that is different from what the user expect the system to do. It's obviously a bug and we need a fix. The first step should be to ask why did the bug happen?
- The first cause can be: because an unexpected change happened in production, which is obvious.
- The second cause can be: because the developer made a change that broke it, which is a little stronger than the previous one.
- The third one can be: because the system was build, which starts going too far from what we really want but is still a valid causation (no system, no bug).
In this case, the most reasonable ultimate cause seems to be the second one: because the developer made a change that broke it.
The proximate cause is the most obvious one and the ultimate cause is not that obvious but can hold the real reason why something happened, in a high-level form.
It's not enough to spot the ultimate cause, though. Just because a developer made the change that doesn't mean the developer is the ultimate cause of the bug. We need to embody our child within and start asking "whys" as much as we can to dig deeper into what really happened.
- Why did the developer make the wrong change? Because there was no way to predict the side-effects.
- Why there was no way to predict the side-effects? Because the code quality of the system is low.
- Why is the code quality of the system low? Because we haven't applied enough effort to make the code quality high.
- Why haven't we applied enough effort to make the code quality high? Because the team is expected to meet insane deadlines and that makes it impossible for us to improve the code.
This series of questions have shown that there's more to the bug than what initially seems, and the last question probably unveiled the root cause of it.
Surprisingly, in the example above, the conclusion has nothing to do with the code.
We can call the high-level ultimate cause of a bug as the "immediate cause". The idea is to try to dig as deep as possible into the bug until we find the "root cause" that we can act upon more efficiently
The most naive way to "fix" Jane's bug without trying to find the root cause is to ask the developer to change the code to what the user expects and call it a "fix". The problem is that the root cause won't be uncovered and more problems will keep happening. However, once we have more information we can act directly on the root cause. Once we deal with that then we can consider the bug fixed, otherwise, it's just a "workaround".
I know this sounds harsh. In our profession we do a lot of those "naive fixes" and call it a day as if it was the right thing to do. The idea of this mindset, though, is mostly to raise awareness. It's ok to go back and do this "workaround", as long as the team understands that it's not really fixing anything. At least now the team will be aware of the real reason why the bug happened and it will be a conscious decision to go the other way if they decide to do so.
Dealing with the immediate cause without understanding the root cause of a bug is not a "fix", it's just a "workaround"
We usually tend to focus on the immediate cause of a bug and that can lead us to take an action closer to the high-level proximate cause instead of forcing us to look deeper and analyze all the potential vectors that can have caused that bug until we can reach the root of the problem. Just because the cause is obvious that doesn't mean that investing just enough effort in fixing that will be a viable long-term solution.
According to Alexander Magoun, a historian from IEEE, "bugs are not only for today’s computers or software; for more than 100 years, they have represented the challenges of an imperfect world that engineers work to overcome".
I believe that in order to deal with this imperfect world and find the best possible solution for all bugs we need to focus on the root cause and come up with more efficient actions to prevent those bugs from happening, otherwise we will keep struggling over and over again building workarounds for problems that should simply be fixed.
Next time you see a bug you may realize that there’s more underneath from what it seems, with a very efficient solution eager to be discovered.
Let's try and find them!