Hiring People With a Knack for Incident Management

How much time do you spend thinking about soft skills when hiring engineers? I’m guessing I’m not the only one copy-pasting “good communication skills, fast learner, stress resistance”. Not because I was too lazy to think about it, but because I didn’t think it matters. I always thought that technical skills are important for any tech position and of course, the same applies to anyone who handles production incidents. If a person has been working with java for 15 years and some python service crashes in production — it would probably take her more time to fix than anyone who has experience with python. And timing is super important during incident handling, so we would be checking only CVs of people with relevant python experience. This sounds logical but is actually completely incorrect. Soft skills matter as much or even more than technical skills for incident management. And in my opinion, they’re way more difficult to define in a job description, check during an interview and find in candidates. This post will help you address each of these difficulties.

Two stories about the power of communication

What am I talking about and why? I’d like to share a few stories to clarify.

A few years ago we were switching a DDoS protection solution from one external provider to another. The project took 7 to 8 months of work, our top-level networking engineers were going through domains, endpoints and routes one by one setting rules, testing, calibrating and rechecking all over again. We were ready to celebrate a happy end of an amazing project, but during the actual switch, something went wrong. The switch itself happened during the night in our time zone, an on-call engineer from the networking team was woken up and almost immediately found out that the change is irreversible. So he started debugging what broke and why does the new system fail in some specific, but really important cases. It took him 7.5 hours of debugging before other team members woke up and one of them recalled a stupid hack done for handling these cases. The same hack was placed into the new system in the next 3 minutes. Of course, there were a lot of things that could be done better before and during the incident, but these 7.5 hours of partial downtime could’ve been an hour or less if communicated correctly.

Conversely, one of my favorite “DevOps day-to-day life” stories is about soft skills helping in a very weird incident. It all started as simple and fast to fix the incident. An API key id from one of the external services expired. It was used in a production service, so the on-call got notified by service owners about the need for a new key id immediately, which he created in the next few minutes. Service owners just had to replace the key id in the code and it should’ve solved the whole incident. But no. This key id was used in a mobile application, so in order to have it replaced, every user of this app had to install an update. This already made us forget about the idea of a “fast to fix” incident. But we couldn’t wait for the update, so we started to think about ways to restore the old expired key id. We contacted the support of the external service and asked for help. They answered that there is no way the key could expire because their ids don’t have an expiration date. Most likely someone simply deleted it, which means that a) it’s possible to restore the key id; b) in order to restore it we need to know under which project of this external service the key was created. It could’ve been an easy task, but we had 350 projects with numerous keys in each and with no log of key deletions. Which basically meant that our key id could’ve been in any of them. The on-call engineer decided to go with the simplest plan — call the developers of the app and ask where they did they get the key id from. The answer was as easy as the plan — “I found it in our company Github”.

Ok, so we could check the Github too and contact the developers of the other service that was using it and ask where they got it from. Github search showed 7 projects using the same key id(except for the project we already knew about). The on-call engineer started calling the owners of these 7 services one by one. No one remembered where the id appeared from. When he got to the last one we were already sure nothing would work. The last project was a legacy one, with its last commit dated 5 years back. The on-call still decided to try and call the project owner. After a few minutes of unexpected questions to a very surprised developer, he remembered who gave him the key. It started to get easier — we just needed to call that person and ask under which project he created a key a bit more than 5 years ago. Needless to say that he was as surprised about the questions the on-call asked him as was the developer from the previous call. But, he remembered that back then, 5 years ago, there were just a few projects, and all of them were supposed to be closed already. After a few more minutes, he remembered names of projects and indeed in one of them we found a key id with the same name as the “expired” one but created today. After a few more hours the support team of external provider restored our previous key id and the service started to work again. There was literally nothing in the handling of this incident that required any technical skills. But it still got handled.

I’m not trying to convince you to hire people without tech skills. I just think that technical skills are not the only skills needed. When you’re hiring someone who will get to be a part of an on-call schedule and is expected to handle production incidents (not necessarily as the only job they have, but even once in a while, as any service owner would), asks yourself — do you feel comfortable with the fact that the fast recovery of your production relies on her communication skills? Are you sure that in a stressful environment (as incidents are for many people) this person will stay focused and concentrated on production and not on a fear?

Defining soft skills for people dealing with production incidents

Here’s a list of skills I think people working on production incidents should have (and the more work like this they do, a list like this one becomes more of a requirement for them):

1) They’re open. People working on incidents shouldn’t have a problem to share information, accept mistakes, get more people involved in solving the problem or just get approval before doing any actions.

2) Communication skills. These people can explain the problem in an easy to understand manner to anyone with any level of understanding of the system or with any level of tech knowledge. They can express themselves, can listen to others and can notice important details in a huge amount of text or in a talk which usually happens while discussing any incident).

3) Attention to detail. Usually, if we can predict an incident, we would put preventive fixes or a safety net for it. This doesn’t happen with incidents we didn’t even think of. And that’s the most common type of incident. You have to have people who are able to notice any slight change in any part of your system — this change can potentially be the cause of all of your problems.

4) Risk assessment. There will be times your people will have to choose the best option out of several. You want to make sure they know how to do it right.

5) Ability to say sorry. As I already said before — production incidents are often a real stressor for the people handling them. People cope with stress differently and not all of them are able to stay calm, smiley and friendly. Stress is a great base for conflicts, blame and letting out of emotions. It’s extremely important to be able to say sorry afterward. Sometimes, even if you did everything right.

I think it’s important to not just write down soft skills in the job description or leave them for the HR team to check, but to actually test them during an interview, as with any other skill.

How to test soft skills during an interview?

I’d like to give an example of an exercise we used for combining checks of tech skills with soft skills. We called it “handling an imaginary production incident live”.

We told candidates that there is an issue in production. We will get anything the person can see, check or get, but the candidate has to talk us through the thinking process and ask questions. Usually, the story went as follows — your website returns a 500 error page. You have monitoring in place, and it just alerted that your server-side application is not responding, the database down and the DNS is not working. Please, go ahead and tell us what you’ll be doing with this information. As soon as the candidate started telling us which logs they would check — we would tell them what’s in the log, bringing the story to a root cause of all the issues and finishing by asking the candidate how to fix it.

What’s important in this exercise is:

1. We can check if the person opens up. If they share thoughts, raise ideas, find the pros and cons of these ideas and are doing everything verbally without being shy to accept that some ideas are wrong — it’s a very good sign.

2. Communication skills — I usually like to pretend to not really understand the terms that the person uses or just to forget some concepts. For example, if they are discussing with another interviewer the possible ways DNS could fail, I’d ask what is DNS and then continue asking questions until we come to the most basic explanation. That’s what constantly happens during incidents that affect the whole company — that’s what has to be checked in advance. Of course, a good sign also would be if the person doesn’t forget about notifications and reporting and states them as steps of the incident handling.

3. Attention to detail, in my opinion, is best checked by actually providing a part of any log file — most of the time the important details will be consumed and have to be noticed visually, but honestly, we never did that, as the whole task was verbal. Verbally you can check it by just mentioning many irrelevant details and putting a small but important one among them. For example, if the candidate wants to check a log, you can start talking about the time you see in the log, the format of the log, read some parts of a stack trace, tell an actual error, continue reading which lines of code were executed while getting the error, etc.

4. Risk assessment is the basic feature to check with such a task. As you remember, the task started as a backend service, the DB and DNS failed. The candidate has to decide what to start checking first, getting risk assessment into play.

It’s possible to have separate tasks for different skills or to check soft skills while focusing on tech tasks only — ask about different levels of tech definitions for people with different levels of tech knowledge, check how the person communicates with the interviewers if there are questions being asked, check the logic the person uses to select tasks or define priorities. Communication is important in any relationship, and this includes the ones at work.

How to make a decision?

Of course, as with any other skill, you shouldn’t say “no” to people lacking in the area, just because they’re not exactly matching the list of requirements. Every skill can be trained, replaced or “balanced” with other team members or a manager, but it’s very important to realize this fully and to understand all the relevant concerns. For example, if a person lacks communication skills you can ask his team lead to join shifts of this employee and take the communication and reporting parts, keeping the communication flow as usual. The main downside of this is that you’ll always have 2 people on shift instead of one. If you’re ok with it — that’s totally fine to make a hire. Another possible option is to design special training to teach your future employees to communicate the way your company needs.

I’ll address this and more in the next 2 parts of the blog — onboarding of employees dealing with production incidents and incidents training.
See you soon!