On-call teams at startups have three big problems: they’re small, they cover a wide breadth of infrastructure, and the last two points usually imply that they lack the bandwidth to maintain and write documentation for a suite of DevOps tools. At SigOpt, our on-call team tackles these challenges with a biannual “disaster recovery exercise”, or simulated outage.
In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team. I hope that by the end you’ll consider running a disaster recovery exercise for your on-call team!
A disaster recovery exercise is a fire drill for your on-call team. The exercise is the most useful when it is as realistic as possible. A well-designed exercise will involve engineers searching through your production codebase trying to find the tools to operate on a production-like environment.
Our disaster recovery exercises follow four basic principles:
At SigOpt, we run on AWS, so our first exercise was to spin up an API from scratch in our backup region. Our sterilized environment was us-east-1, with no access to AMIs, instances, or databases in our production region. Our objective was to hit dr-api.sigopt.com and service an API requests. Our timebox was 4 hours, which we chose from an engineering OKR.
Tip: Create new AWS keys for each exercise to avoid accidentally deleting production resources (and temporarily deactivate current keys to ensure the new ones are used!)
We ran our original disaster recovery exercise to diagnose holes in our ability to recovery our infrastructure. True to our goal, the exercise produced a few months of projects to work on.
While we found many larger projects, funnily enough, though, the quickest fixes were usually the least obvious bugs. For example:
To find problems large and small, we run a debrief meeting to conclude the exercise. In this meeting, we talked candidly about what worked, and what didn’t, referring to notes taken during the exercise.
Here are some of the questions that we ask ourselves during our debriefs:
At SigOpt, we are constantly trying to learn. Though started as a way to diagnose infrastructure, the disaster recovery exercise quickly proved to be a fantastic trial-by-fire learning opportunity for our small team, and engineers reported increased self-confidence in their on-call problem solving abilities.
We use the following principles to set up the team dynamics for the exercise:
Additionally, because all team member are together, in one room, working on one problem, the disaster recovery exercise is a unique team-building exercise. To extend the team-building atmosphere after the recap meeting, we’ve started to include dinner and drinks as an offsite!
A disaster recovery exercise is many things for us. It’s a fire drill that proves to newer team members they have what it takes to be a part of our on-call rotation. It’s a diagnostic to identify bad, broken, or hard-to-find tools. And, it’s a team bonding exercise where everyone sits down together for a few hours to solve a challenge. Next time you’re planning a team event for your on-call team, I hope you’ll consider your own disaster recovery exercise! If you do, I’d love to hear about it.