Relax, nobody lost a forest.
What they did lose were a multitude of essential data logs.
The subject company here will be referred to as TheCompany. The developer that found out about the issue is Bob.
In this article I want to discuss human errors in software development and preventive measures for those.
TheCompany’s process for deploying codebase changes is summed up like this:
The checklist is compared to a before-takeoff checklist for pilots. As mentioned in the linked article:
The […] practical test standards clearly state that pilots must use appropriate written checklists
You wouldn’t argue about the need to do a check of all systems before taking off in an airplane, would you? Would you board a plane if an airport staff member would say that the pilots skipped some checks to be able to depart faster?
I know I wouldn’t.
Yet the accident record shows that some pilots don’t [follow the checklist]. Such actions can yield dire results.
TheCompany is not an airline and the cost of their mistakes most likely isn’t human lives. However, they do impact the company’s bottom line, which in the worst case scenario causes people to be let go.
Altogether, in TheCompany there are 3 checklists — 1) Pre-production, 2) During-production and 3) Post-production
There’s clearly a lot of things that could go wrong and to put the blame on a specific reason is to not see the whole picture. So let’s unpack some of the aforementioned points,
Here’s why the problem occurred and led to the loss of a multitude of logs.
7-day-timeframe log check should be enforced
Developers usually only check last 1 to 4 hours of logs during the release. This means that if any previous production release that is older than 4 hours had an anomaly in the logs, it would again go unnoticed. This loop theoretically can go on forever.
In fact, Bob has already caught multiple bugs exactly from doing a 7-day-timeframe log check. So far, nobody — neither the management, nor the developers — has considered a 7-day-timeframe a necessity.
Time is also not an issue here because a developer can quickly switch the logs to show 7 days, 1 day, 1 hour periods.
Currently there is no proper daily monitoring of the logs. They are done in scope of the checklists but that’s about it. A solution could be to implement a mandatory check of logs at the end of the day.
When I was working in Evolution Gaming in Latvia, my team had rotating roles where a person would check Sentry errors logs, notify the team about pending Pull Request reviews, conduct the daily standup and more. There were daily roles and roles assigned to a person for the whole sprint (2 weeks).
To ensure all of this is followed properly and without boredom, we also added a gaming mechanism to these processes e.g. points for each pull request review.
We can all talk about how reviews should be done but the reality is, people often either forget, are overwhelmed with other tasks, simply don’t want to do a code review or a myriad of other reasons. Incentives are a good way to manage some of these friction points.
Same approach could be adapted by TheCompany regarding monitoring. Unless you’re a startup that just raised $100M in Series A, you definitely have time to invest in these worthwhile activities that benefit everybody in the long run.
2 Approvers or More
Every team that I have been a part of that had an effective code review process had a minimum of 2 approvals per PR. This has proven so beneficial that I even try to enforce this approach during my freelancing contracts.
Main reason why one approver is often not enough is the same as why you need at least 2 pilots to fly a plane. If one forgets to flip a switch on the panel, the other pilot is there to fix the mistake.
Regardless whether its unit or integration tests. They are your safety net.
The hack part of code was not covered by integration tests. In fact, TheCompany has no integration tests for Bobs team and the fight to introduce them is surprisingly tough. Not because of technical difficulties, but because of lack of coherence, motivation and (maybe) ignorance.
Subjectively, the #1 reason most developers don’t do tests is because “why bother if it already works and I can hack something together so much faster without the friction of having to think about tests”.
The #2 reason is short deadlines.
If we talk about unit tests, it’s sometimes hard for people to see the benefits of them because it’s difficult to quantify how many bugs they actually prevent.
Thing is, unit tests prevent bugs during development. Most likely your project will have tests running before pushing to production. Thus, even if 1 test fails, the merge is aborted. This gives you a chance to fix the issue and push again.
Even better, have tests run in watch mode. I’ve seen developers program without tests running in the background only to find themselves spending way too much time trying to backtrack in their changes.
Even in watch mode, you can be caught up in the work and go too far. This can be prevented by adapting a refactoring approach. In Martin Fowlers book about Refactoring he suggests to refactor meticulously while staying in the green i.e. if a test fails, revert your last change and implement it in a way that doesn’t break the tests. The approach he takes is “only change 1 line at a time”. Surely, depending on how trivial the case, you might tweak this approach.
Then there’s TDD. Some studies have shown that TDD saves your project from 40% to 90% of bugs. This approach is even more protested by developers even though it’s proven to benefit in the long run. Which is what you want to think about if you don’t plan to go out of business anytime soon. High testing standards also give a competitive edge against other companies.
Implement Integration Tests
Of course, some flows of the code might be harder to cover with unit tests or even seem like you’re doing an integration/unit test hybrid . It’s a tricky question. In such a case I like to remember the advice of Sandy Metz:
I would probably consider covering the hack with a unit test that acts as an integration test. Make it as minimal as possible. Cover it with extensive documentation and create a ticket to remove the tests as soon as proper integration tests are introduced.
Also, I'm not an expert on integration tests, but in this case they would’ve prevented the loss of the data logs.
Last point that I wanted to mention is about the lessons from Jocko Willinks book “Extreme Ownership”. I strongly believe that the failure of understanding the importance of following the checklists is partially on the supervisors of the teams — either technical leads or project managers — and the developers themselves.
The fact that developers and QAs don’t strictly follow the checklists means that there isn’t a crystal clear understanding of their importance and what problems might occur in case of missing even a single step. Of course, this might also be because the supervisors don’t know about some parts of the processes.
7 days after the PR was pushed to the production environment, a tally mechanism for comparing server and client logs went into development. It would notify of any discrepancies between server and client logs.
Bobs team started a process of introducing weekly monitoring of logs. However, there hasn’t been any follow up on this from the supervisors.
Fortunately, talks have been started to address future issues like these.
There’s a ton of actions that can be taken to improve processes in a company. Activities that would prevent from failures both outside or inside the code. However, one things for sure, unfortunately — we, humans, need a disaster to convince us of the importance of preparation.