Who’s to Blame? (v0.1)

A few weeks ago, I took down production. I ran a script that was meant to converge our sandbox environments and I accidentally converged our production environment. Production went down and stopped serving all customers for the time being. And at the moment, the client I was consulting for was monitoring the production traffic of our platform and noticed the outage, then reached out to my team and said “Hey, whats happening?”

My heart flutters, I get nervous, my partner is standing over me exclaiming “What did you do?!” I scrambled to compose my thoughts and converge production back and just like that, it was over. I had made the mistake I had been dreading since I started this massive project.

Now your initial instinct, might be to say, “I do not want that person on my team.” You might be thinking I was careless, I’m young, I wasn’t thinking about what I was doing. It was the result of human error and perhaps the best solution would be to remove the human (i.e. me) from future situations or instill some level of fear in me to teach me a lesson. These thoughts might not drift too far from what you thought if you read about the AWS S3 outage. The public release reported “an authorized S3 team member using an established playbook executed a command… Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” Although, that S3 outage had far bigger implications than mine did, the mistake was similar in its nature. In fact when I heard about the S3 outage that brought down most of the internet for the better part of a day, in all honesty, I thought that individual most likely got fired.

What I didn’t tell you about my incident is that the entire week leading up to it, we had worked over 60 hours putting an urgent feature into production for our client. We were in a position that forced us to skip our traditional automation process, instead we did most of it manually. The best we could scrap together was some bash and python scripts that we couldn’t even run from our own machines, we had to get into other machines to do it due to AWS security groups, SSL cert issues, and other restrictions we would fix later but didn’t have time for at the moment. We were on a severe deadline to get this feature into production and we had reached the end of our line. Now what are you thinking? Surely I could have handled a stressful situation like that better, but there are many individuals who can improve on this quality. It is widely known that stress impairs our cognitive abilities(1), blurring our decision making skills and clouding our memory. But did the context I just presented about the incident I triggered change anything about the story and what conclusions you might draw from it?

The reality is, pointing fingers is easy to do, and its gratifying.

I work in the area of system reliability and incidents happen, a lot. When an incident happens in software, especially in complex teams and systems, it can be difficult and overwhelming to figure out what went wrong. But its always easy to ask, “who touched the code last”, “when was something last deployed to production”, or “what button did you push?”. Thanks to version control and user tracking, now we can pretty accurately figure out the answer to these questions faster than we can diagnose really complex problems. So we do: we find a name, we point a finger, and we say it was someone else’s responsibility. It absolves us as individuals from the guilt, embarrassment, or work thats required to fix the issue. Putting the blame on someone else is a natural response to help us avoid situations we don’t want to be held accountable for. But what does this do for our organizations and ours teams?

Human error will never be the one and only cause.

Human error is just a label for what we believe the individual should have done but didn’t. When we break down an incident into what should have occurred and didn’t, we often leave out details that can help us reason about these incidents. The idea of looking into the environment and circumstances that led an individual to make decisions is defined by David Woods in his book “Behind Human Error” as seeking a second story. Without going into too much detail, the following value comes from seeking a second story:

Saying what people should have done will not answer the questions as to why it made sense at the time for them to do what they did. Thus, “blame and shame” culture leaves us with no more information than we had before the incident.
Human error is seen as the effect of vulnerabilities deeper inside the organization and determining those vulnerabilities leaves us better prepared for tomorrow.

When we stop blaming engineers, we improve our technology.

Esty is by far the finest example of this. Notably the best demonstration of creating a just, blameless culture, Etsy has paved the way in supporting engineers who do make mistakes so that they may be able to support the rest of the organization. Their goal is to dig deeper into incidents to understand the situational circumstances that led to them, or what about the situation led an engineer to do what they believe was appropriate at the time. They view mistakes a as a source of learning.

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future.Humans will make mistakes. Allow it, encourage it, and embrace it.(2)

Humans will make mistakes. Allow it, encourage it, and embrace it.

Automation won’t completely save you (although it can help avoid major incidents). Neither will removing individuals from positions. These are common responses to large incidents. But with automation, people can press the wrong button at the wrong time. And if you remove people from teams, its likely that someone else will make the same mistake or another. It will happen and you want to be there to support it when it does because teams will learn from it and grow from it.

Now theres a fine line to walk between clicking buttons all the time with little care, and exercising caution when you run a script or accidentally converge a production environment. I’m not justifying what I did when I took down production, I should have exercised greater care, but automation would have helped us. There is a line to be drawn and when it is, it should be understood by the entire organization.

In the end, we can only understand failure when we come to realize the source of failure and our reactions to it.

Whats to come in v0.2?

When I get more time, I’ll publish a post on methodologies and practices folks can use to promote a “just” culture. We all know its easier said than done.