Infrastructure Engineer & Site Reliability Evangelist
A well-known expression states “Hope for the best and prepare for the worst” — I was repeating it to myself over and over again while traveling between countries and offices with my 6 hours long, 220 slides, just-theory-no-practice incident management training. It was a fantastic experience, and I got great feedback. But today, three years later, I think it was one of the most stupid ideas of mine.
In this post of the Incident Management series, after we talked about Onboarding, I’ll go over the lessons I’ve learned, answering the “why” “when” and “what” of incident management training. So you can make sure your team is ready for any kind of production incidents in the most efficient and the least frustrating way.
First 4 years of working on production incidents, I was utterly sure that the only thing that matters is experience. I was (and still is) a fan of the excellent book “Outliers: The story of success” by Malcolm Gladwell, so I was just waiting to get my 10K hours of experience with incidents to become a real pro finally.
Somehow, all the people working with me on production incidents were keeping the veil of magic over the work we were doing. And I believed that magic myself: “it’s not written anywhere, but I know it”, “I had a feeling it would crash”, “I’m on it, don’t ask me anything — just watch” kind of magic.
Needless to say, when I’ve met my great future manager, and he asked me if I was ever thinking of teaching others to deal with production incidents — I laughed. How can you teach someone magic?
The only thing I’ve learned during the next four years — incident management is a methodology (or set of them) and has completely nothing to do with magic.
Because, as much as a proper onboarding can give a great base, this base is still not enough for handling production incidents efficiently. By efficiently, I mean “mitigate an impact as fast as possible with the minimum resource possible”.
In my opinion, proper incident management training has to cover the following goals:
The closer to the first on-call shift — the better. I suggest having such training after 2.5 -3 months a new employee is working in the company and 1.5–2 weeks before the first on-call shift. It will give enough time for a new team member to learn about a company, pass an onboarding, meet other employees in the company, and feel more comfortable and less stressed.
I think it’s also essential to have such training once in a while for any team working on production systems. Incident management training is a good reminder about ownership, an impact that your work has on the others — both customers and employees, and the methodology your company has.
I started this post with the story of the training I did — super long, super tedious, and super theoretical. And even such training worked. Of course, every one of us likes to do amazing projects with the best quality. Unfortunately, the real world is amending. You should be ready for iterations and improvements, but the sooner you start doing such training — the better.
I’ll describe several options for incident management training with pros and cons for each, so you can choose the one which fits you the best.
Option 1 — “The Boring”:
The Pareto principle works for production incidents too — you can do 20% of the work to cover 80% of possible problems in your training. Incidents are repeating, techniques that you can use to mitigate them are limited. It’s possible to prepare runbooks, procedures, and simple algorithms to follow to mitigate 80% of all production incidents.
So during the training itself, you should just introduce all the documents to team members, describe the logic behind it, and show a fast and easy way to get any of these documents.
Ideally, if you can have most of the content in schemas — it’s easier to follow and faster to consume information. Try to make the training no longer than 1 hour and 30 slides of the presentation.
Option 2 — “The Actionable”:
You can choose 3–5 incidents that happen more often (or just take the most recent ones), simulate them on a dev environment and handle in a format of the workshop (so that everyone participating will need to fix them too).
Option 3 — “The Best”:
This option is inspired by the advanced driving course I had — it’s a combination of theoretical and practical options. The training started with a short presentation with a bit of theory and interactive questions, was followed by a practical task during which there was more theory, then — tasks repeated few times and ended up with a small competition. I think such a format is just great for incident management training.
I hope you’ve already chosen the option you like the most, but before you get to work on it, I’d like to share a few more important details to sum the topic up:
See you next week with a post about Oncall schedules!
Level up your reading game by joining Hacker Noon now!