When building an incident response process, it's easy to get overwhelmed by all the moving parts. Less is more: focus first on building solid foundations that you can develop over time.
Here are three things we think to form a key part of a strong process.
I'd recommend taking these one at a time, introducing incident response throughout your organisation.
Just being transparent: we're a startup providing incident management software - incident.io. We believe that using our software will help build a good incident process, otherwise, we wouldn't be doing it. But beyond that, we've also got a lot of experience building and participating in incident processes and think this advice is generically useful, regardless of whether you choose our product.
This is, in the words of Julie Andrews, a very good place to start.
Getting a common understanding of what an incident is (and isn't) is the first step in bringing people into your incident response process.
An incident is where something unexpected happens that has (or might have) a negative consequence.
Particularly when you're getting started, the best way to embed a process into your organisation is to use it. A lot. This also helps everyone learn the process, and get better at incident response overall, meaning that when something really bad happens it feels like a well-oiled machine.
To this end, you want a broad and inclusive definition of an incident:
Transparency by default is a really important value to bake into your incident process.
First up: make sure it's clear who is responsible for communicating. Whether that's the incident lead, or another chosen individual, making some explicitly accountable is the best way to keep the updates coming.
Make it really easy to tell stakeholders what's going on, and use the tool that makes the information easiest to consume (whether that's email, Slack or something else entirely).
Use a predictable format for the updates, as this makes them easier to parse and scan if you're a busy stakeholder flying through their notifications.
These updates also advertise your incident process and normalise the fact that incidents happen. Ideally, someone's first interaction with the incident process at your org should be as a consumer of an update --- not being parachuted into the middle of something.
Get comfortable admitting that things go wrong, both internally and (to some extent) externally with customers. This builds trust and enables people to adapt their behaviour to mitigate the impact on their side (e.g. if a customer knows you are having an outage, they'll move to another task and come back tomorrow instead of furiously refreshing the page).
Your incident process shouldn't end once the problem is resolved. To get value from your incidents, you want to be using them to learn and improve your day-to-day operations. There are often follow-up actions that need to be put on someone's backlog or wider problems that should be considered and prioritised.
Post-mortem documents and incident reviews are a great way to extract learnings from an incident. They help encourage reflection and often bring up related concerns about the way the team is working.
However, there's a Goldilocks' zone. If the post-incident process feels painful, people will stop declaring incidents at all.
Clearly communicate the value of your post-incident process, making sure it doesn't feel like a box-ticking exercise. Don't make it harder than it needs to be: try to avoid repetitive manual work (e.g. using a tool or template).
Give people autonomy to decide what's appropriate for the incident they've experienced. Sometimes a simple update in the incident channel explaining what happened and why is completely sufficient. Sometimes you'll want to run a full-blown cross-team workshop to understand what happened and why the response didn't go as smoothly as you'd hoped.
When you're first building an incident response process, focus on a few key things:
Once you've got these nailed, you can start layering more stuff on top. But don't try to run before you can walk.