When growing a business from a startup to a large enterprise, its software systems also expand in complexity, meaning that encountering incidents is inevitable. These incidents have indirect costs that can include loss of trust in the product or brand. However, incidents occurring are not necessarily a bad thing — it provides the business with new learning opportunities and the chance to improve its operational practices. But how do we learn from such a failure?
An incident postmortem is a meeting that brings together all of the people that were involved, whether directly or indirectly, to discuss, document and derive value. Postmortems should remain positive to avoid demotivating the technical teams further. Postmortems should also be blameless - this empowers engineers to provide details of their contribution and prevents the establishment of fear culture. Learning should be the focus rather than dwelling on what went wrong, and so any past tense discussions should stick to facts instead of opinions — avoiding phrases that include would have, could have or should have is essential.
A variety of different stakeholders should get invites to the postmortem, including:
It may also be appropriate to invite a user that was directly affected by the incident. Invite a range of different people like this maintains transparency and allows us to glean as much information as possible to document.
Thorough documentation of the event is vital. Memories are short, and in time important details fade into obscurity. Some key facts to document are:
In addition to the above details getting recorded in a postmortem document, the meeting should also have minutes taken, and a timeline of the incident constructed.
The postmortem produces discussion points to bring into further sessions and documentation that will prove valuable in the future.
Values recorded should be compared against any service level agreements (SLAs) that may be in place, to confirm that the incident did not result in any breaches. Any issues identified as a result of the incident should be discussed in-depth, with potential solutions or mitigations planned into the roadmap, alongside rigid delivery dates. These solutions/mitigations should have tickets written to capture the work, each of which should be SMART. Depending on the incident, particularly regarding who first logged the incident and how long it had been ongoing before being logged, improvements to the observability may be required. Observability improvements should be a priority alongside immediate solutions to the faults.
If an external user reported the issue, it might be pertinent to publish the postmortem's findings openly, allowing anyone access. Publishing postmortem outcomes publicly has most notably been utilized by Monzo. It enables them to maintain transparency, ensures accountability as a business and has provided their users with greater trust in the brand.
Previously published at https://kylejones.io/how-to-perform-a-successful-incident-postmortem