My first incident.io-incident happened in my second week here, when I screwed up the process for requesting extra Slack permissions, which made it impossible to install our app for a few minutes. This was a bit embarrassing, but also simple to resolve for someone more familiar with the process, and declaring an incident meant we got there in just a few minutes.
Declaring the first incident when you start a new job can be intimidating, but it really shouldn't be. Let's look at some common fears, and work out how to address them.
Most organizations have some kind of incident response procedure that will a list of things you have to do, like working out how many customers this affected, deciding how to make things right with them, and putting in proactive measures to avoid this happening again.
That can seem like a daunting prospect. You might ask Is this issue really worth all that?
A safe default should be "yes". If it turns out this issue wasn't so bad after all, you should be able to shut down the response pretty quickly. But if it escalates further, you'll be glad you already have the process rolling.
You can find out more about how to automate your incident process in our previous article.
If it does end up being a lot of work, that isn't necessarily a bad thing. As an individual, you're likely to learn a lot from tackling your first big incident. As a team or company, you've addressed a serious issue proactively.
If the answer is "yes" you should probably be looking for another job.
As Chris discussed before, having more incidents is not a bad thing --- it's not in the long-term interests of any organisation to brush small incidents under the rug, because you can never tell which might turn into huge problems later on.
The same logic applies here too! Most teams that don't have any incidents are either not taking any risks and slowing down delivery, or hiding their problems. Neither of those is a sign of a healthy team.
That's not to say that managers should set a target number of incidents per team per quarter, but it does mean that managers should be looking at outlier teams that have very few incidents, as well as those that have more than their fair share.
Are they afraid of getting blamed? Are they spending so long making sure everything they deliver is perfectly robust that they forget about their customers? Did they struggle to respond to an earlier incident effectively and need extra help learning how it's done well?
There will always be a first time, and it's probably better if the first incident you run isn't a critical "everything is down" one. A major incident is stressful enough without having to learn about your organization's response processes at the same time.
Game days, where you run a pretend incident in a non-production environment, are great for learning your incident response process (and honing your debugging skills!), but you have to apply that knowledge to something real sooner or later.
Even with blameless post-mortems and very well-run incident response processes, I've seen teams who decide that downtime on a key product isn't an incident because it hadn't been down for that long and probably no one had noticed.
When joining a team, especially as a more experienced engineer, part of the experience you bring is about the different cultures you've worked in, including how you respond when things go wrong. This might feel uncomfortable, shifting the culture of a team always does, but the payoff is absolutely worth it.
Also published here.