Post Mortems are for learning and sharing

An introduction to how we do Post Mortems at Skyscanner Engineering

Post Mortems are for learning and sharing. Illustration by Skyscanner’s Gavin Spence.

By John Paris

This is the first post in our Post Mortems series we hope to continue with as the weeks go on. But first, let’s go into a little more detail on what Post Mortems mean to us at Skyscanner Engineering

At Skyscanner we know that our technology or our processes will let us down at some point and that’s OK.

What’s important for us is that we take the opportunity to learn and then share that experience across our organisation so that others don’t experience the same issue. This is especially true as our technology platforms scale up to meet the demand of travellers across the globe. We want to be sure that the teams who are riding the front of that wave on their surfboards share their experience with others who are still on their body boards. We also want to make sure that Skyscanner is there to service the expectations of travellers and our partners around the globe and in all time-zones.

Following an Outage, our teams will conduct a ‘Postmortem’ (aka Post Incident Review). In these reviews we lift the lid on the events and circumstances that led up to the failure looking for ways to prevent re-occurrence and also improve detection and mitigation.

How these reviews are carried out is essential to their success. If you want to really learn then you have to leave any questions of blame at the door. This is not a new concept and there is plenty of reading available on Blameless Postmortems from many specialists in academia and across many industries.

“…leave any questions of blame at the door.” Illustration by Skyscanner’s Gavin Spence.

We also have to focus on the detail, be obsessive if we are to get to the root cause(s). The simplest of processes we use here is the 5 Whys which even in a complex environment fits for the majority of incidents. However, it has its limits. If you’re interested in some further reading from technology people on these matters, then you can’t go wrong starting with:

John Allspaw from his time at Etsy
Mathias Lafeldt and his experiences that lead from mistakenly deleting an instance
This from our good friend Jason Hand at Victor Ops

This is the introduction to an on-going series. For now, onto the first Post Mortem:

🐞 That time a dormant bug came alive and everything went wrong at the same time by Dave Archer and Matt Hailey

Like what you hear? Work with us

We do things differently at Skyscanner and we’re on the lookout for more Engineering Tribe Members across our global offices. Take a look at our Skyscanner Jobs for more vacancies.

We’re hiring!

About the Author

Hi, I’m John and I’m a Technical Manager working for Skyscanner in our Edinburgh office. I’m the team leader for a squad who are responsible for ensuring Skyscanner is available for Travellers around the globe when they want to use us. In our role we provide a last line of defense for service failures and enable engineering teams to become stronger DevOps in the sprit of “you build it, you run it”.

In my spare time you’ll find me on a bike, running up hills and walking up mountains especially in the Scottish winter.

Want to find out more about what it’s like to work at Skyscanner or more about what I do here? Email me at john(dot)paris(at)skyscanner(dot)net.

John Paris, Skyscanner