How DevOps Repair Logs Increase Your Happiness

Written by fmira21 | Published 2023/07/05
Tech Story Tags: devops | documentation | devops-principles | devops-infrastructure | logs | productivity | team-productivity | programming-tips

TLDRAfter each large outage, we created a document describing the nature and first occurrence of an outage, our course of action, and outcomes. Post-mortems were finalized with a follow-up section indicating plans and Jira tickets to prevent similar events in the future. I really like this approach and advocate this practice to avoid recurring issues.via the TL;DR App

How do you document infrastructure failures and outages?

Post-mortem discipline: a way to document serious problems

At the very beginning of my career, I familiarised myself with the post-mortem practice. After each large outage (fortunately, they were really rare), we created a document describing the nature and first occurrence of an outage, our course of action, and outcomes. Post-mortems were finalized with a follow-up section indicating plans and Jira tickets to prevent similar events in the future.

You can familiarise yourself with a typical structure of a DevOps post-mortem in this article, as well as with a great example here.

Our structure was similar:

  • First occurrence and event log
  • The course of action and responsible persons
  • Root cause analysis
  • Follow-ups to mitigate similar issues

I really like this approach and advocate this practice to avoid recurring issues. The nature of DevOps tasks speaks for itself: the task density is so high that it’s hard to remember every workflow. In this case, post-mortems are useful to keep vital information available, introduce the experience gained for the rest of the team, as well as to onboard newcomers.

More issues, less time

However, when I acquired my own fields of responsibility and started to work independently, I realized very soon that my scope of work included much more issues to fix than time to document them in post-mortems.

The nature of my tasks included dealing with dozens of multilayer services, each of them being quite volatile and sensitive to multiple hardware factors: network speed, free disk space, inter-service connectivity, etc. Moreover, they were based on open-source software and required constant updates to remain functional - and often, an update only brought new issues. All in all, it looks like an administrator nightmare, doesn’t it?

In fact, it was not, but still, I soon realized the necessity to document how I fixed small issues of different nature. And here, I returned to the Post-Mortem workflow wondering if it can help - and adopted it.

Large repair logs to solve small problems

First of all, I decided to use one unified document (namely, a Confluence page) to store descriptions of all fixed issues. Since my work sometimes reminded me of Warhammer 40k Adeptus Mechanicus, I named it “Prayer Book: How we fixed minor issues”. And presented this document officially as a Repair Log.

So, what was inside?

I stuck to our Post-Mortem plan but excluded some redundant steps: namely, the first and last bullets from the plan above. The resulting structure was easy to read and could be filled in 5-10 minutes for each issue:

  • Short description of an issue and which service it affected
  • The root cause of the issue and, optionally, steps to reproduce
  • The course of actions to mitigate the issue

On top of this structure, I made two rules to fill the log:

  • When? - Always, right after you fix an issue.
  • How? - Short and precise, don’t highlight general issues, describe actions you took even if you are not sure they fixed the issue.

The latter rule may surprise you, but it’s self-explanatory: sometimes, you try to fix an issue in multiple ways, and lack of time to check the outcome does not allow you to be sure what exactly solved the problem. I decided to leave such entries for future amendments.

So I created a small and fast-growing version of Post-Mortems for internal use.

What was the outcome?

The Repair Log was appreciated by the rest of my team and started to grow. Together with Post-Mortems, this document became a vital part of our onboarding process. It is easy to read and expand, but most importantly, it remains my “second memory” when I face any issue that looks familiar.

In some months, I even started to notice that the number of issues began to decrease, and nearly all of them became a part of the log. Then, I said to myself: “Probably, it’s the best solution to increase the quality of your operations with these services”. Now, I still work with them, but maintenance does not take a significant amount of time.


Written by fmira21 | DevOps Engineer, ex-Product, ex-TechWriter, ex-Interpreter. Writing about (self)education, DocOps and DevOps practices.
Published by HackerNoon on 2023/07/05