How do you document infrastructure failures and outages?
At the very beginning of my career, I familiarised myself with the post-mortem practice. After each large outage (fortunately, they were really rare), we created a document describing the nature and first occurrence of an outage, our course of action, and outcomes. Post-mortems were finalized with a follow-up section indicating plans and Jira tickets to prevent similar events in the future.
You can familiarise yourself with a typical structure of a DevOps post-mortem in this article, as well as with a great example here.
Our structure was similar:
I really like this approach and advocate this practice to avoid recurring issues. The nature of DevOps tasks speaks for itself: the task density is so high that it’s hard to remember every workflow. In this case, post-mortems are useful to keep vital information available, introduce the experience gained for the rest of the team, as well as to onboard newcomers.
However, when I acquired my own fields of responsibility and started to work independently, I realized very soon that my scope of work included much more issues to fix than time to document them in post-mortems.
The nature of my tasks included dealing with dozens of multilayer services, each of them being quite volatile and sensitive to multiple hardware factors: network speed, free disk space, inter-service connectivity, etc. Moreover, they were based on open-source software and required constant updates to remain functional - and often, an update only brought new issues. All in all, it looks like an administrator nightmare, doesn’t it?
In fact, it was not, but still, I soon realized the necessity to document how I fixed small issues of different nature. And here, I returned to the Post-Mortem workflow wondering if it can help - and adopted it.
First of all, I decided to use one unified document (namely, a Confluence page) to store descriptions of all fixed issues. Since my work sometimes reminded me of Warhammer 40k Adeptus Mechanicus, I named it “Prayer Book: How we fixed minor issues”. And presented this document officially as a Repair Log.
So, what was inside?
I stuck to our Post-Mortem plan but excluded some redundant steps: namely, the first and last bullets from the plan above. The resulting structure was easy to read and could be filled in 5-10 minutes for each issue:
On top of this structure, I made two rules to fill the log:
The latter rule may surprise you, but it’s self-explanatory: sometimes, you try to fix an issue in multiple ways, and lack of time to check the outcome does not allow you to be sure what exactly solved the problem. I decided to leave such entries for future amendments.
So I created a small and fast-growing version of Post-Mortems for internal use.
The Repair Log was appreciated by the rest of my team and started to grow. Together with Post-Mortems, this document became a vital part of our onboarding process. It is easy to read and expand, but most importantly, it remains my “second memory” when I face any issue that looks familiar.
In some months, I even started to notice that the number of issues began to decrease, and nearly all of them became a part of the log. Then, I said to myself: “Probably, it’s the best solution to increase the quality of your operations with these services”. Now, I still work with them, but maintenance does not take a significant amount of time.