We Crashed and Lost All Essential Data Logs. Where Did We Go Wrong?

Relax, nobody lost a forest.

What they did lose were a multitude of essential data logs.

The subject company here will be referred to as TheCompany. The developer that found out about the issue is Bob.

In this article I want to discuss human errors in software development and preventive measures for those.

From Development To Production

TheCompany’s process for deploying codebase changes is summed up like this:

Developer tests his own changes (Attributed as a common sense item on the list thus not enforced)
Code review (Only 1 reviewer required who often has limited time to spend on your ticket)
Pushing to pre-production
Testing in pre-production
Go over a checklist before pushing to production
Go over a checklist during pushing to production
Go over a checklist after pushing to production

The checklist is compared to a before-takeoff checklist for pilots. As mentioned in the linked article:

The […] practical test standards clearly state that pilots must use appropriate written checklists

You wouldn’t argue about the need to do a check of all systems before taking off in an airplane, would you? Would you board a plane if an airport staff member would say that the pilots skipped some checks to be able to depart faster?

I know I wouldn’t.

Yet the accident record shows that some pilots don’t [follow the checklist]. Such actions can yield dire results.

TheCompany is not an airline and the cost of their mistakes most likely isn’t human lives. However, they do impact the company’s bottom line, which in the worst case scenario causes people to be let go.

Altogether, in TheCompany there are 3 checklists — 1) Pre-production, 2) During-production and 3) Post-production

Pre-production checklist:

Schedule the push to production in the teams calendar
Pull Request has been reviewed and approved
Affected teams test and approve the changes
Assigned tester ensures all test cases pass and meet the requirements
Ticket has no conflicts with testing or production environment
Run smoke/regression tests on desktop and mobile

During-production checklist:

No current incident (failing build) in the release pipeline
No current ticket is being prodded at the same time
Announce in Slack that prodding has begun
Press the red button to push to production

Post-production checklist:

Check deploy script logs to see if prodding was successful
Run smoke/regression tests on desktop and mobile
Monitor all logs and graphs for 1h
Notify affected teams
If there are no issues, announce in Slack that deployment was successful

There’s clearly a lot of things that could go wrong and to put the blame on a specific reason is to not see the whole picture. So let’s unpack some of the aforementioned points,

Why it went wrong
How it could’ve been prevented
What TheCompany did in the aftermath.

The Recipe For Disaster

Here’s why the problem occurred and led to the loss of a multitude of logs.

Pull Request contained a minor change to a line of code that might be considered a hack which also had a vague description
In the same PR 1/3 of the program flows weren't tested by Bob because he assumed it was the same as the other flows
Code reviewer didn’t notice the hack line change
Test case to check for the soon-to-be-broken-logs was not added by Bob (Note: There is a shortage of Quality Assurance people in the company)
Tester which was also the Code Reviewer did not notice log requests not being sent (Note: If there is no QA, Code reviewer becomes the Tester)
After pushing to production and during monitoring, the missing logs weren’t noticed because the drop in the graph was too small in comparison to other logs
No daily monitoring of logs and graphs exists. Hence, no one noticed the issue
Next PRs that went into production either didn’t follow the checklists and/or didn’t notice the drop in the graphs i.e. same as step 6.

How the loss could’ve been prevented

7-day-timeframe log check should be enforced

Developers usually only check last 1 to 4 hours of logs during the release. This means that if any previous production release that is older than 4 hours had an anomaly in the logs, it would again go unnoticed. This loop theoretically can go on forever.

In fact, Bob has already caught multiple bugs exactly from doing a 7-day-timeframe log check. So far, nobody — neither the management, nor the developers — has considered a 7-day-timeframe a necessity.

Time is also not an issue here because a developer can quickly switch the logs to show 7 days, 1 day, 1 hour periods.

Daily monitoring

Currently there is no proper daily monitoring of the logs. They are done in scope of the checklists but that’s about it. A solution could be to implement a mandatory check of logs at the end of the day.

When I was working in Evolution Gaming in Latvia, my team had rotating roles where a person would check Sentry errors logs, notify the team about pending Pull Request reviews, conduct the daily standup and more. There were daily roles and roles assigned to a person for the whole sprint (2 weeks).

To ensure all of this is followed properly and without boredom, we also added a gaming mechanism to these processes e.g. points for each pull request review.

We can all talk about how reviews should be done but the reality is, people often either forget, are overwhelmed with other tasks, simply don’t want to do a code review or a myriad of other reasons. Incentives are a good way to manage some of these friction points.

Same approach could be adapted by TheCompany regarding monitoring. Unless you’re a startup that just raised $100M in Series A, you definitely have time to invest in these worthwhile activities that benefit everybody in the long run.

2 Approvers or More

Every team that I have been a part of that had an effective code review process had a minimum of 2 approvals per PR. This has proven so beneficial that I even try to enforce this approach during my freelancing contracts.

Main reason why one approver is often not enough is the same as why you need at least 2 pilots to fly a plane. If one forgets to flip a switch on the panel, the other pilot is there to fix the mistake.

Better tests

Regardless whether its unit or integration tests. They are your safety net.

The hack part of code was not covered by integration tests. In fact, TheCompany has no integration tests for Bobs team and the fight to introduce them is surprisingly tough. Not because of technical difficulties, but because of lack of coherence, motivation and (maybe) ignorance.

Subjectively, the #1 reason most developers don’t do tests is because “why bother if it already works and I can hack something together so much faster without the friction of having to think about tests”.

The #2 reason is short deadlines.

If we talk about unit tests, it’s sometimes hard for people to see the benefits of them because it’s difficult to quantify how many bugs they actually prevent.

Thing is, unit tests prevent bugs during development. Most likely your project will have tests running before pushing to production. Thus, even if 1 test fails, the merge is aborted. This gives you a chance to fix the issue and push again.

Even better, have tests run in watch mode. I’ve seen developers program without tests running in the background only to find themselves spending way too much time trying to backtrack in their changes.

Even in watch mode, you can be caught up in the work and go too far. This can be prevented by adapting a refactoring approach. In Martin Fowlers book about Refactoring he suggests to refactor meticulously while staying in the green i.e. if a test fails, revert your last change and implement it in a way that doesn’t break the tests. The approach he takes is “only change 1 line at a time”. Surely, depending on how trivial the case, you might tweak this approach.

Then there’s TDD. Some studies have shown that TDD saves your project from 40% to 90% of bugs. This approach is even more protested by developers even though it’s proven to benefit in the long run. Which is what you want to think about if you don’t plan to go out of business anytime soon. High testing standards also give a competitive edge against other companies.

Implement Integration Tests

Of course, some flows of the code might be harder to cover with unit tests or even seem like you’re doing an integration/unit test hybrid . It’s a tricky question. In such a case I like to remember the advice of Sandy Metz:

From Sandy Metz talk at Rails Conf 2013 — https://www.youtube.com/watch?v=URSWYvyc42M

I would probably consider covering the hack with a unit test that acts as an integration test. Make it as minimal as possible. Cover it with extensive documentation and create a ticket to remove the tests as soon as proper integration tests are introduced.

Also, I'm not an expert on integration tests, but in this case they would’ve prevented the loss of the data logs.

Extreme Ownership

Last point that I wanted to mention is about the lessons from Jocko Willinks book “Extreme Ownership”. I strongly believe that the failure of understanding the importance of following the checklists is partially on the supervisors of the teams — either technical leads or project managers — and the developers themselves.

The fact that developers and QAs don’t strictly follow the checklists means that there isn’t a crystal clear understanding of their importance and what problems might occur in case of missing even a single step. Of course, this might also be because the supervisors don’t know about some parts of the processes. In that case, it’s the job of developers and QAs to “report up the chain of command”.

What Happened In The Aftermath

7 days after the PR was pushed to the production environment, a tally mechanism for comparing server and client logs went into development. It would notify of any discrepancies between server and client logs.

Bobs team started a process of introducing weekly monitoring of logs. However, there hasn’t been any follow up on this from the supervisors.

Fortunately, talks have been started to address future issues like these.

There’s a ton of actions that can be taken to improve processes in a company. Activities that would prevent from failures both outside or inside the code. However, one things for sure, unfortunately — we, humans, need a disaster to convince us of the importance of preparation.