Incident Management Process: How to Train For The Tech-Fu
Infrastructure Engineer & Site Reliability Evangelist
A well-known expression states “Hope for the best and prepare for the worst” — I was repeating it to myself over and over again while traveling between countries and offices with my 6 hours long, 220 slides, just-theory-no-practice incident management training. It was a fantastic experience, and I got great feedback. But today, three years later, I think it was one of the most stupid ideas of mine.
In this post of the Incident Management series, after we talked about Onboarding
, I’ll go over the lessons I’ve learned, answering the “why” “when” and “what” of incident management training. So you can make sure your team is ready for any kind of production incidents in the most efficient and the least frustrating way.
How I came up with the idea of Incident Management training?
First 4 years of working on production incidents, I was utterly sure that the only thing that matters is experience. I was (and still is) a fan of the excellent book “Outliers: The story of success” by Malcolm Gladwell
, so I was just waiting to get my 10K hours of experience with incidents to become a real pro finally.
Somehow, all the people working with me on production incidents were keeping the veil of magic over the work we were doing. And I believed that magic myself: “it’s not written anywhere, but I know it”, “I had a feeling it would crash”, “I’m on it, don’t ask me anything — just watch” kind of magic.
Needless to say, when I’ve met my great future manager, and he asked me if I was ever thinking of teaching others to deal with production incidents — I laughed. How can you teach someone magic?
The only thing I’ve learned during the next four years — incident management is a methodology (or set of them) and has completely nothing to do with magic.
So, why should you have incident management training?
Because, as much as a proper onboarding can give a great base, this base is still not enough for handling production incidents efficiently. By efficiently, I mean “mitigate an impact as fast as possible with the minimum resource possible”.
In my opinion, proper incident management training has to cover the following goals:
- explain what production incidents and on-call shifts are
- teach a team member your methodology — what on-call engineer is expected to do, what escalation policies are, what notification flow is and how to get help
- make sure a team member has a solid knowledge base or knows where to find information in the fastest way
- make sure a team member has all the required permissions for an on-call shift
- decrease the stress of the first on-call shift as much as possible
When to have incident management training?
The closer to the first on-call shift — the better. I suggest having such training after 2.5 -3 months a new employee is working in the company and 1.5–2 weeks before the first on-call shift. It will give enough time for a new team member to learn about a company, pass an onboarding, meet other employees in the company, and feel more comfortable and less stressed.
I think it’s also essential to have such training once in a while for any team working on production systems. Incident management training is a good reminder about ownership, an impact that your work has on the others — both customers and employees, and the methodology your company has.
How to build incident management training?
I started this post with the story of the training I did — super long, super tedious, and super theoretical. And even such training worked. Of course, every one of us likes to do amazing projects with the best quality. Unfortunately, the real world is amending. You should be ready for iterations and improvements, but the sooner you start doing such training — the better.
I’ll describe several options for incident management training with pros and cons for each, so you can choose the one which fits you the best.
Option 1 — “The Boring”:
The Pareto principle
works for production incidents too — you can do 20% of the work to cover 80% of possible problems in your training. Incidents are repeating, techniques that you can use to mitigate them are limited. It’s possible to prepare runbooks, procedures, and simple algorithms to follow to mitigate 80% of all production incidents.
So during the training itself, you should just introduce all the documents to team members, describe the logic behind it, and show a fast and easy way to get any of these documents.
Ideally, if you can have most of the content in schemas — it’s easier to follow and faster to consume information. Try to make the training no longer than 1 hour and 30 slides of the presentation.
- shorter time for preparation
- more comfortable to switch “lectors”
- the statistic you will need to get to prepare training will help you to build KPIs and monitor improvements
- you have to collect a lot of previous cases and analyse them — get statistics and data for the docs
- new docs can be treated as “bureaucracy” and cause resistance from current employees
- not covering goals of decreasing stress and checking permissions, as it’s entirely theoretical.
Option 2 — “The Actionable”:
You can choose 3–5 incidents that happen more often (or just take the most recent ones), simulate them on a dev environment and handle in a format of the workshop (so that everyone participating will need to fix them too).
- much more fun than a theoretical type of training
- can cover all the goals of training
- more interactive — most probably, there will be a lot of questions raised and more things discussed.
- it depends on your environment, but it can be very time consuming to set up and prepare
- most probably team members will learn and remember specific cases and not a methodology
- there will be no reference to return to and remind the flow during the real incident in the future
Option 3 — “The Best”:
This option is inspired by the advanced driving course I had — it’s a combination of theoretical and practical options. The training started with a short presentation with a bit of theory and interactive questions, was followed by a practical task during which there was more theory, then — tasks repeated few times and ended up with a small competition. I think such a format is just great for incident management training.
- it’s fun and super informative
- it shares a methodology and logic behind any actions with actually making you remember the actions
- it’s a full-day training, so you can check how it feels to work on incidents when you’re full of energy or when you’re already tired — the most realistic view on handling incidents.
- very time consuming to prepare
I hope you’ve already chosen the option you like the most, but before you get to work on it, I’d like to share a few more important details to sum the topic up:
- Always try to make your training realistic. Use real data and real statistics, build dev environment very similar to production, interrupt, ask questions, make sure there are notifications, etc.
- Talk it through. People are and will be doing mistakes and it’s totally fine. Teach your team to focus on important things and make sure they know they can fix anything.
- Make your team members feel that you and the team got their backs. It doesn’t matter what happens and how and how bad it is in the moment of the incident — you can handle anything together.
See you next week with a post about Oncall schedules!
Subscribe to get your daily round-up of top tech stories!