The Muri of a Bad On-Call Systemby@this_hits_home
2,274 reads
2,274 reads

The Muri of a Bad On-Call System

by this hits homeMarch 21st, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Not to pretend like there is some decidedly “good” on-call system. It is easily the toughest work/life balance problem we face. But for sake of argument, let us do pretend it isn’t going away anytime soon.
featured image - The Muri of a Bad On-Call System
this hits home HackerNoon profile picture

Not to pretend like there is some decidedly “good” on-call system. It is easily the toughest work/life balance problem we face. But for sake of argument, let us do pretend it isn’t going away anytime soon.

Because really.

“People, processes, and tools” is a feedback loop in which each part is prone to failure. Just like the systems we are creating for business, on-call is yet another system in our organizations that needs to be engineered with proper appreciation. Failure to do so only contributes to a burdensome, destructive spiral.

Ambivalence and resentment toward on-call is so incredibly dangerous to morale that it will ultimately affect the velocity at which people do their daily work. This can feed each piece of the puzzle in increasingly negative ways, eventually resulting in things like tarnishing the reputation of individuals when encountering failure — the worst thing a bad process can do.

Failing to develop ways to deal with failures in the on-call system will only perpetuate further failure.

It is paramount that teams give significant appreciation to over-escalation and understanding of why people were called when an incident does occur. This goes so much further than a postmortem. To begin with, incident handling needs to be in a mature enough place where the incidents being raised are truly exceptional, alerting only because they could not be handled. And the people working those incidents should be uninhibited in making decisions and contributing to the improvement of the system.

If mundane things are causing lots of on-call commotion, then building reliability into the business needs to be more of a focus on principle. During an incident, this can come in the role of a “Duty Manager” — not to call out people at fault or coerce status updates, but to examine all aspects of the incident and the very process itself to come to a critical conclusion about what can be improved with regard to its systems, why certain actions made sense at the time, and how they may have been conclusively wrong.

If unavailability of key individuals, remote working, or on-call participation are causing pain, it is indicative that an organization is not effectively implementing DevOps practices.

Maybe you’re having trouble crossing the chasm from a business value perspective on this.

Is getting to five nines, reducing customer pain, and focusing on reliability going to sell one more copy of our product?

How much do your major incidents cost to run?

I’ll take prioritizing around improvement of daily work and a continual savings on OpEx every time.

We should be attempting to discourage the reliance of on-call to solve uninteresting problems. We should not optimize our typical way of working around on-call, but instead view it as a means of temporarily increasing capacity in our systems to respond to fundamentally surprising events.

We can try to look at how a bad on-call system overburdens our organizations from a resilience engineering perspective.

The Law of Stretched Systems

Every system is ultimately stretched to operate at its capacity as efficiency improves and human leaders exploit new capabilities to demand more complex forms of work.

This may sound similar to Jevons paradox or ‘induced demand’. The good old oversimplified car analogy would be like increasing the number of freeway lanes—eventually more cars will just fill up the freeway. So it’s kind of to the point that adding more people to your team is not always a sufficient fix, there will always be more work to do.

However, there is the idea of the “margin of maneuver”. The ability for a team to employ methods not normally used under typical operations (e.g. On-call, in our case) to allow work to continue through increased demand or failure which helps systems exhibit more resiliency. Adapting to changing demands is one of the aspects Agile attempts to robustly address, after all.

The problem comes in that if you are always operating like this, you are never accurately measuring your true capacity or finding out where the boundaries are. Using a temporary burst of capacity as a perpetual way of working makes it impossible to gain ground and further the things needed to escape the downward spiral.

When looking at high reliability organizations under the guise of the law of stretched systems, we can define two key characteristics:

  • Brittleness — A manifestation of a stretched system. A system with little or no margin to accept further capacity is said to be ‘solid’.
  • Resiliency — The properties of a system that allow it to absorb unusual amounts of stress without causing a failure in the integral function of the organization. [Clarification: Resilience is not something systems have, it is something systems do. Resilience is a verb.]

People failing to respond while on-call often triggers the lizard brain element of cardinal frustration and creates a lot of confusion. But we have to overcome that and try to think instead in terms of mechanisms like ChaosMonkey or game day exercises in production. Dare to test the brittleness of your on-call system. What really happens to your organization when the critical on-call engineers — whose level of expertise is always changing — fail to respond to a major incident?

In examining high reliability organizations, we find two major themes with regard to incident management:

  • Anticipation—focused on preparedness, this is a state of mindfulness throughout the entire org in which continuous vigilance for potential sources of harm is expected and practiced as a shared value.
  • Containment — actions to be taken immediately when a system fails to advert or mitigate further damage and injury.

These two properties are about preventing failure and developing the ability to respond, routinely culminating at some point in naive discussion of what we often refer to as “situational awareness” (which is kind of a misleading, underspecified term) — a concept sometimes obscuring the actual reasons why it made sense for someone to act in a way that they did when contributing to a failure. We can’t stop exploring at “shit happens.”

If we only spend effort and time on preventing failure from happening, we are not developing our ability to respond to failure when it does. We have a responsibility to develop those abilities to respond as engineers. — John Allspaw, Resilient Response In Complex Systems

How do we cope with this dilemma holistically? The journey has to begin with encouraging people to accept an escalation. Since people are the greatest source of adaptive capacity in complex systems, we should explore the conditions both in which people are failing to participate reliably in on-call and of those who are ‘saving the day’ repeatedly (and without promoting the hero culture).

An ideal world, as much as it can be, has a voluntary, opt-in on-call policy with follow-the-sun staffing where the primary engineer is never outside of their business hours and the rotation is created automatically based on calendar preference — but reality is more often not so refined.

Applied Behavior Analysis

Generally, I don’t put a lot of stake in ‘behavior’ when broaching the subject of resilience, but there are some organizational gotchas at play here. Taking the ABA/OBM approach, the crux of the issue here is that we want to increase the behavior of people accepting an escalation when they are paged.

In speaking with a BCBA, I have learned that punishment doesn’t work. Punishment doesn’t teach anything. People cannot learn appropriate and effective alternative behaviors through punishment.

If you want to increase a behavior, you have to give people some sort of payoff — incentivizing via immediate reward (and not always money, but sure). Positive and negative reinforcement.

Real values determine incentives. Incentives determine behaviors. — Andrew Clay Shafer, There is no Talent Shortage: Organizational Learning is a Competitive Advantage

Some considerations:

  • Start with paying people appropriately, properly, and legally for on-call. But know that overtime really is not an incentive.
  • Prepare and train people for how to participate in on-call. Host incident response role-plays among critical teams, even for the experts, to develop a strong understanding of preparedness, engagement, and introspection. Everyone in the org, everyone, should comprehend the expectations, roles, and etiquette around incident management and how it relates to the business. Champion these ideas during a regular internal conference event focused on reliability, availability, and learning efforts.
  • Provide a laptop, phone, and tethering/hotspot for every on-call member so that people will be able to remote into work from anywhere they happen to be. Consider creating an on-call care package to be issued to new teammates.
  • Bonus/stipend payout for successfully completing an entire on call shift where there was at least one escalation. If a person fails to accept an escalation during their shift (barring various acceptable edge cases), the bonus is forfeited entirely and is divvied up between the next people in the list who successfully respond and pick up the slack.
  • If someone’s on-call shift was particularly rough, give a comp day. Not just expecting someone in late, working from home, or unofficially time off. Actually give an honest-to-god paid, full day off. And don’t require a person to ask for it. Be timely, appreciative, and proactive about awarding them. If you’re a cool cat and don’t have a PTO policy, be direct and forthcoming in telling folks to take time off.
  • Have technology that encourages people to handoff their on-call shift after a certain amount of hours in aggregate. Not just posting a per-incident long-running notice, but a real system in place to do so that is maturely respected. We must discourage praising the heroics of persistence through on-call.
  • I hesitate to suggest this one because it can be dangerous and mistakenly hype up the burnout and heroics we want to avoid if not implemented properly, but consider gamification of your on-call system. Milestones and recognitions met with reward. Don’t forget to include the limitations.
  • Give people an on-call vacation. Allow 3 months minimum every year out of the rotation.
  • Allow exemptions from mandatory on-call lists for certain classifications and conditions like mental health and accessibility.
  • Depends on the size and make-up of the team, but I suggest allowing parents with children under 3 to voluntarily exempt themselves. Parents with young kids are the most unreliable on-call participants and a burden on the team will exist regardless of whether or not they are on rotation — either the burden of having fewer people on-call or the burden and additional confusion of a parent intermittently rejecting escalations due to childcare. I don’t think there’s anything worthwhile that is going to make, for example, a parent stop feeding their infant to click the accept button for an escalation.

Other Considerations

The vast majority of alerts should not be initiated by a person.

One of the last steps in any alerting pipeline should be a machine-learning or event processing system that can de-duplicate and consolidate the data into few sensible alerts.

This requires the alerts & conditions to contain metadata the system can key off of that will additionally allow it to decorate a chatops workflow.

There are many robust on-call escalation products such as VictorOps, PagerDuty, and xMatters that can help organize your rotations without rolling your own orchestrator.

While there is something to be said for the innate sysadmin ability to navigate the unknown, it is nevertheless super frustrating being on-call for a service you had no part in — meaning no design involvement, training, communication with the team, etc. This may be indicative of a Brent problem.

  • Alerting systems should have some sort of ‘component’ mechanism that teams can subscribe to if they are to support it. This way, escalations can initially be automatically routed most of the time to all necessary parties.
  • Your build/deploy/delivery system is a great place to source and join data that describes service configuration, alerts, and the teams who support them with your on-call escalation system. An ad-hoc call button for any service in the UI is very handy.
  • Both developers and operations people on service teams should be alerted. Some make the argument that only developers should be on-call. I disagree with this, but I do understand it in one specific context where the operations people are serving only to further escalate an incident due to not being empowered to actually make substantial contributions to improve the reliability of the system (what’s the point of them being on call then?). I would argue that this is a cultural problem. Resentment grows if all disciplines of the org are not participating in on-call. This means including folks from CS, PR, and Legal during major incidents.

The primary and secondary on-call participants should receive a notification when they become active in rotation, and everyone in the rotation should be notified when the schedule is modified.

People should be able to remove themselves from on-call rotation as needed without necessary approval. Emergencies and events happen which will make your on-call engineers unavailable, and getting them out of the rotation is the best option because robust communication with the secondary on-call engineer is not always possible.

Dear product owners — fear of senior management cannot outweigh your sympathy for individuals. You cannot preach a forgiving, blameless on-call culture where people should be free to go to dinner or the grocery store or the park and then later rebuke their remissness when your boss comes down on you. You should have had the systems in place to allow that engineer to be easily removed from the rotation temporarily with an intricate, reliable on-call design supporting it.

Failing to respond to an escalation should generally not be reprimanded. There is a fundamental systems problem that cannot cope with what are normal, uninteresting, and expected failures of people if this is really an issue. And folks certainly should not be admonished for things explained by the pitfalls of hindsight bias (e.g. bad calls, careless mistakes, momentary incompetence).

Any sort of consequence to an individual should just naturally come in the way of feedback from their peers and mentors. Punishment-less learning reviews should follow. Demonstrate that the organization values restorative practices over retributive justice.