How DevOps Repair Logs Increase Your Happiness

How do you document infrastructure failures and outages?

Post-mortem discipline: a way to document serious problems

At the very beginning of my career, I familiarised myself with the post-mortem practice. After each large outage (fortunately, they were really rare), we created a document describing the nature and first occurrence of an outage, our course of action, and outcomes. Post-mortems were finalized with a follow-up section indicating plans and Jira tickets to prevent similar events in the future.

You can familiarise yourself with a typical structure of a DevOps post-mortem in this article, as well as with a great example here.

Our structure was similar:

First occurrence and event log
The course of action and responsible persons
Root cause analysis
Follow-ups to mitigate similar issues

I really like this approach and advocate this practice to avoid recurring issues. The nature of DevOps tasks speaks for itself: the task density is so high that it’s hard to remember every workflow. In this case, post-mortems are useful to keep vital information available, introduce the experience gained for the rest of the team, as well as to onboard newcomers.

More issues, less time

However, when I acquired my own fields of responsibility and started to work independently, I realized very soon that my scope of work included much more issues to fix than time to document them in post-mortems.

The nature of my tasks included dealing with dozens of multilayer services, each of them being quite volatile and sensitive to multiple hardware factors: network speed, free disk space, inter-service connectivity, etc. Moreover, they were based on open-source software and required constant updates to remain functional - and often, an update only brought new issues. All in all, it looks like an administrator nightmare, doesn’t it?

In fact, it was not, but still, I soon realized the necessity to document how I fixed small issues of different nature. And here, I returned to the Post-Mortem workflow wondering if it can help - and adopted it.

Large repair logs to solve small problems

First of all, I decided to use one unified document (namely, a Confluence page) to store descriptions of all fixed issues. Since my work sometimes reminded me of Warhammer 40k Adeptus Mechanicus, I named it “Prayer Book: How we fixed minor issues”. And presented this document officially as a Repair Log.

So, what was inside?

I stuck to our Post-Mortem plan but excluded some redundant steps: namely, the first and last bullets from the plan above. The resulting structure was easy to read and could be filled in 5-10 minutes for each issue:

Short description of an issue and which service it affected
The root cause of the issue and, optionally, steps to reproduce
The course of actions to mitigate the issue

On top of this structure, I made two rules to fill the log:

When? - Always, right after you fix an issue.
How? - Short and precise, don’t highlight general issues, describe actions you took even if you are not sure they fixed the issue.

The latter rule may surprise you, but it’s self-explanatory: sometimes, you try to fix an issue in multiple ways, and lack of time to check the outcome does not allow you to be sure what exactly solved the problem. I decided to leave such entries for future amendments.

So I created a small and fast-growing version of Post-Mortems for internal use.

What was the outcome?

The Repair Log was appreciated by the rest of my team and started to grow. Together with Post-Mortems, this document became a vital part of our onboarding process. It is easy to read and expand, but most importantly, it remains my “second memory” when I face any issue that looks familiar.

In some months, I even started to notice that the number of issues began to decrease, and nearly all of them became a part of the log. Then, I said to myself: “Probably, it’s the best solution to increase the quality of your operations with these services”. Now, I still work with them, but maintenance does not take a significant amount of time.