Relax, nobody lost a forest. What they did lose were a multitude of data logs. essential The subject company here will be referred to as The developer that found out about the issue is TheCompany. Bob. In this article I want to discuss human errors in software development and preventive measures for those. From Development To Production process for deploying codebase changes is summed up like this: TheCompany’s Developer tests his own changes (Attributed as a common sense item on the list thus not enforced) Code review (Only 1 reviewer required who often has limited time to spend on your ticket) Pushing to pre-production Testing in pre-production Go over a before pushing to production checklist Go over a during pushing to production checklist Go over a after pushing to production checklist The is compared to a for pilots. As mentioned in the linked article: checklist before-takeoff checklist The […] practical test standards clearly state that pilots must use appropriate written checklists You wouldn’t argue about the need to do a check of all systems before taking off in an airplane, would you? Would you board a plane if an airport staff member would say that the pilots skipped some checks to be able to depart faster? I know I wouldn’t. Yet the accident record shows that some pilots don’t [follow the checklist]. Such actions can yield dire results. is not an airline and the cost of their mistakes most likely isn’t human lives. However, they do impact the company’s bottom line, which in the worst case scenario causes people to be let go. TheCompany Altogether, in there are 3 checklists — 1) Pre-production, 2) During-production and 3) Post-production TheCompany Pre-production checklist: Schedule the push to production in the teams calendar ull equest has been reviewed and approved P R Affected teams test and approve the changes Assigned tester ensures all test cases pass and meet the requirements Ticket has no conflicts with testing or production environment Run smoke/regression tests on desktop and mobile During-production checklist: No current incident in the release pipeline (failing build) No current ticket is being prodded at the same time Announce in Slack that prodding has begun Press the red button to push to production Post-production checklist: Check deploy script logs to see if prodding was successful Run smoke/regression tests on desktop and mobile Monitor all logs and graphs for 1h Notify affected teams If there are no issues, announce in Slack that deployment was successful There’s clearly a lot of things that could go wrong and to put the blame on a specific reason is to not see the whole picture. So let’s unpack some of the aforementioned points, it went wrong Why it could’ve been prevented How did in the aftermath. What TheCompany The Recipe For Disaster Here’s the problem occurred and led to the loss of a multitude of logs. why ull equest contained a to a line of code that might be considered a hack which also had a vague description P R minor change In the same 1/3 of the program flows weren't tested by because he assumed it was the same as the other flows PR Bob Code reviewer didn’t notice the hack line change Test case to check for the soon-to-be-broken-logs was not added by Bob (Note: There is a shortage of Q uality A ssurance people in the company) Tester which was also the Code Reviewer did not notice log requests not being sent (Note: If there is no QA , Code reviewer becomes the Tester) After pushing to production and during monitoring, the missing logs weren’t noticed because the drop in the graph was too small in comparison to other logs No daily monitoring of logs and graphs exists. Hence, no one noticed the issue Next PRs that went into production either didn’t follow the checklists and/or didn’t notice the drop in the graphs i.e. same as step 6. How the loss could’ve been prevented 7-day-timeframe log check should be enforced Developers usually only check last 1 to 4 hours of logs during the release. This means that if any previous production release that is older than 4 hours had an anomaly in the logs, it would again go unnoticed. This loop theoretically can go on forever. In fact, has already caught multiple bugs exactly from doing a 7-day-timeframe log check. So far, nobody — neither the management, nor the developers — has considered a 7-day-timeframe a necessity. Bob Time is also not an issue here because a developer can quickly switch the logs to show 7 days, 1 day, 1 hour periods. Daily monitoring Currently there is no proper daily monitoring of the logs. They are done in scope of the checklists but that’s about it. A solution could be to implement a mandatory check of logs at the end of the day. When I was working in Evolution Gaming in Latvia, my team had rotating roles where a person would check errors logs, notify the team about pending Pull Request reviews, conduct the daily standup and more. There were daily roles and roles assigned to a person for the whole sprint Sentry (2 weeks). To ensure all of this is followed properly and without boredom, we also added a gaming mechanism to these processes e.g. points for each pull request review. We can all talk about how reviews should be done but the reality is, people often either forget, are overwhelmed with other tasks, simply don’t want to do a code review or a myriad of other reasons. Incentives are a good way to manage some of these friction points. Same approach could be adapted by regarding monitoring. Unless you’re a startup that just raised $100M in Series A, you definitely have time to invest in these worthwhile activities that benefit everybody in the long run. TheCompany 2 Approvers or More Every team that I have been a part of that had an code review process had a minimum of 2 approvals per PR. This has proven so beneficial that I even try to enforce this approach during my freelancing contracts. effective Main reason why one approver is often not enough is the same as why you need at least 2 pilots to fly a plane. If one forgets to flip a switch on the panel, the other pilot is there to fix the mistake. Better tests Regardless whether its unit or integration tests. They are your safety net. The hack part of code was not covered by integration tests. In fact, has no integration tests for team and the fight to introduce them is surprisingly tough. Not because of technical difficulties, but because of lack of coherence, motivation and ignorance. TheCompany Bobs (maybe) Subjectively, the #1 reason most developers don’t do tests is because . “why bother if it already works and I can hack something together so much faster without the friction of having to think about tests” The #2 reason is short deadlines. If we talk about unit tests, it’s sometimes hard for people to see the benefits of them because it’s difficult to quantify how many bugs they actually prevent. Thing is, unit tests prevent bugs during development. Most likely your project will have tests running before pushing to production. Thus, even if 1 test fails, the merge is aborted. This gives you a chance to fix the issue and push again. Even better, have tests run in watch mode. I’ve seen developers program without tests running in the background only to find themselves spending way too much time trying to backtrack in their changes. Even in watch mode, you can be caught up in the work and go too far. This can be prevented by adapting a refactoring approach. In he suggests to refactor meticulously while staying in the green i.e. if a test fails, revert your last change and implement it in a way that doesn’t break the tests. The approach he takes is “only change 1 line at a time”. Surely, depending on how trivial the case, you might tweak this approach. Martin Fowlers book about Refactoring Then there’s TDD. . This approach is even more protested by developers even though it’s proven to benefit in the long run. Which is what you want to think about if you don’t plan to go out of business anytime soon. Some studies have shown that TDD saves your project from 40% to 90% of bugs High testing standards also give a competitive edge against other companies. Implement Integration Tests Of course, some flows of the code might be harder to cover with unit tests or even seem like you’re doing an integration/unit test hybrid . It’s a tricky question. In such a case I like to remember the advice of Sandy Metz: From Sandy Metz talk at Rails Conf 2013 — https://www.youtube.com/watch?v=URSWYvyc42M I would probably consider covering the hack with a unit test that acts as an integration test. Make it as minimal as possible. Cover it with extensive documentation and create a ticket to remove the tests as soon as proper integration tests are introduced. Also, I'm not an expert on integration tests, but in this case they would’ve prevented the loss of the data logs. Extreme Ownership Last point that I wanted to mention is about the lessons from . I strongly believe that the failure of understanding the importance of following the checklists is partially on the supervisors of the teams — either technical leads or project managers — and the developers themselves. Jocko Willinks book “Extreme Ownership” The fact that developers and QAs don’t strictly follow the checklists means that there isn’t a crystal clear understanding of their importance and what problems might occur in case of missing even a single step. Of course, this might also be because the supervisors don’t know about some parts of the processes. In that case, it’s the job of developers and QAs to “report up the chain of command”. What Happened In The Aftermath 7 days after the PR was pushed to the production environment, a tally mechanism for comparing server and client logs went into development. It would notify of any discrepancies between server and client logs. team started a process of introducing weekly monitoring of logs. However, there hasn’t been any follow up on this from the supervisors. Bobs Fortunately, talks have been started to address future issues like these. There’s a ton of actions that can be taken to improve processes in a company. Activities that would prevent from failures both outside or inside the code. However, one things for sure, unfortunately — we, humans, need a disaster to convince us of the importance of preparation.