Not to pretend like there is some decidedly “good” on-call system. It is easily the toughest work/life balance problem we face. But for sake of argument, let us do pretend it isn’t going away anytime soon.
Because really.
“People, processes, and tools” is a feedback loop in which each part is prone to failure. Just like the systems we are creating for business, on-call is yet another system in our organizations that needs to be engineered with proper appreciation. Failure to do so only contributes to a burdensome, destructive spiral.
Ambivalence and resentment toward on-call is so incredibly dangerous to morale that it will ultimately affect the velocity at which people do their daily work. This can feed each piece of the puzzle in increasingly negative ways, eventually resulting in things like tarnishing the reputation of individuals when encountering failure — the worst thing a bad process can do.
Failing to develop ways to deal with failures in the on-call system will only perpetuate further failure.
It is paramount that teams give significant appreciation to over-escalation and understanding of why people were called when an incident does occur. This goes so much further than a postmortem. To begin with, incident handling needs to be in a mature enough place where the incidents being raised are truly exceptional, alerting only because they could not be handled. And the people working those incidents should be uninhibited in making decisions and contributing to the improvement of the system.
If mundane things are causing lots of on-call commotion, then building reliability into the business needs to be more of a focus on principle. During an incident, this can come in the role of a “Duty Manager” — not to call out people at fault or coerce status updates, but to examine all aspects of the incident and the very process itself to come to a critical conclusion about what can be improved with regard to its systems, why certain actions made sense at the time, and how they may have been conclusively wrong.
If unavailability of key individuals, remote working, or on-call participation are causing pain, it is indicative that an organization is not effectively implementing DevOps practices.
https://www.flickr.com/photos/ravi-shah/19806997920
Maybe you’re having trouble crossing the chasm from a business value perspective on this.
Is getting to five nines, reducing customer pain, and focusing on reliability going to sell one more copy of our product?
How much do your major incidents cost to run?
I’ll take prioritizing around improvement of daily work and a continual savings on OpEx every time.
We should be attempting to discourage the reliance of on-call to solve uninteresting problems. We should not optimize our typical way of working around on-call, but instead view it as a means of temporarily increasing capacity in our systems to respond to fundamentally surprising events.
We can try to look at how a bad on-call system overburdens our organizations from a resilience engineering perspective.
Every system is ultimately stretched to operate at its capacity as efficiency improves and human leaders exploit new capabilities to demand more complex forms of work.
This may sound similar to Jevons paradox or ‘induced demand’. The good old oversimplified car analogy would be like increasing the number of freeway lanes—eventually more cars will just fill up the freeway. So it’s kind of to the point that adding more people to your team is not always a sufficient fix, there will always be more work to do.
However, there is the idea of the “margin of maneuver”. The ability for a team to employ methods not normally used under typical operations (e.g. On-call, in our case) to allow work to continue through increased demand or failure which helps systems exhibit more resiliency. Adapting to changing demands is one of the aspects Agile attempts to robustly address, after all.
The problem comes in that if you are always operating like this, you are never accurately measuring your true capacity or finding out where the boundaries are. Using a temporary burst of capacity as a perpetual way of working makes it impossible to gain ground and further the things needed to escape the downward spiral.
When looking at high reliability organizations under the guise of the law of stretched systems, we can define two key characteristics:
People failing to respond while on-call often triggers the lizard brain element of cardinal frustration and creates a lot of confusion. But we have to overcome that and try to think instead in terms of mechanisms like ChaosMonkey or game day exercises in production. Dare to test the brittleness of your on-call system. What really happens to your organization when the critical on-call engineers — whose level of expertise is always changing — fail to respond to a major incident?
In examining high reliability organizations, we find two major themes with regard to incident management:
These two properties are about preventing failure and developing the ability to respond, routinely culminating at some point in naive discussion of what we often refer to as “situational awareness” (which is kind of a misleading, underspecified term) — a concept sometimes obscuring the actual reasons why it made sense for someone to act in a way that they did when contributing to a failure. We can’t stop exploring at “shit happens.”
If we only spend effort and time on preventing failure from happening, we are not developing our ability to respond to failure when it does. We have a responsibility to develop those abilities to respond as engineers. — John Allspaw, Resilient Response In Complex Systems
How do we cope with this dilemma holistically? The journey has to begin with encouraging people to accept an escalation. Since people are the greatest source of adaptive capacity in complex systems, we should explore the conditions both in which people are failing to participate reliably in on-call and of those who are ‘saving the day’ repeatedly (and without promoting the hero culture).
An ideal world, as much as it can be, has a voluntary, opt-in on-call policy with follow-the-sun staffing where the primary engineer is never outside of their business hours and the rotation is created automatically based on calendar preference — but reality is more often not so refined.
Generally, I don’t put a lot of stake in ‘behavior’ when broaching the subject of resilience, but there are some organizational gotchas at play here. Taking the ABA/OBM approach, the crux of the issue here is that we want to increase the behavior of people accepting an escalation when they are paged.
In speaking with a BCBA, I have learned that punishment doesn’t work. Punishment doesn’t teach anything. People cannot learn appropriate and effective alternative behaviors through punishment.
If you want to increase a behavior, you have to give people some sort of payoff — incentivizing via immediate reward (and not always money, but sure). Positive and negative reinforcement.
Real values determine incentives. Incentives determine behaviors. — Andrew Clay Shafer, There is no Talent Shortage: Organizational Learning is a Competitive Advantage
Some considerations:
https://www.flickr.com/photos/amandacoolidge/7458259492
The vast majority of alerts should not be initiated by a person.
One of the last steps in any alerting pipeline should be a machine-learning or event processing system that can de-duplicate and consolidate the data into few sensible alerts.
This requires the alerts & conditions to contain metadata the system can key off of that will additionally allow it to decorate a chatops workflow.
There are many robust on-call escalation products such as VictorOps, PagerDuty, and xMatters that can help organize your rotations without rolling your own orchestrator.
While there is something to be said for the innate sysadmin ability to navigate the unknown, it is nevertheless super frustrating being on-call for a service you had no part in — meaning no design involvement, training, communication with the team, etc. This may be indicative of a Brent problem.
The primary and secondary on-call participants should receive a notification when they become active in rotation, and everyone in the rotation should be notified when the schedule is modified.
People should be able to remove themselves from on-call rotation as needed without necessary approval. Emergencies and events happen which will make your on-call engineers unavailable, and getting them out of the rotation is the best option because robust communication with the secondary on-call engineer is not always possible.
Dear product owners — fear of senior management cannot outweigh your sympathy for individuals. You cannot preach a forgiving, blameless on-call culture where people should be free to go to dinner or the grocery store or the park and then later rebuke their remissness when your boss comes down on you. You should have had the systems in place to allow that engineer to be easily removed from the rotation temporarily with an intricate, reliable on-call design supporting it.
Failing to respond to an escalation should generally not be reprimanded. There is a fundamental systems problem that cannot cope with what are normal, uninteresting, and expected failures of people if this is really an issue. And folks certainly should not be admonished for things explained by the pitfalls of hindsight bias (e.g. bad calls, careless mistakes, momentary incompetence).
Any sort of consequence to an individual should just naturally come in the way of feedback from their peers and mentors. Punishment-less learning reviews should follow. Demonstrate that the organization values restorative practices over retributive justice.
#beerpology