Healthy on-call isn’t about fewer incidents. It’s about building a system people can actually sustain. Healthy on-call isn’t about fewer incidents. It’s about building a system people can actually sustain. Most engineering teams can tell when on-call is going badly. The pager is noisy, the same people keep getting pulled into serious incidents, and one bad night quietly spills into the next workday. The symptoms are familiar enough that nobody really needs them explained. What is less clear, in a lot of organizations, is what good on-call is supposed to look like. Teams say they want fewer interruptions, less burnout, and better balance, but those are still negative definitions. They describe what people want to avoid, not the kind of operational model they want to build. That gap matters. If “healthy on-call” only means “less painful than before,” teams will tolerate a lot that they should be fixing. A better standard is more concrete. Good on-call is not one where nothing ever breaks. It is one where the load is shared fairly, the person receiving the page knows what they own, documentation is good enough to use under pressure, and people can recover after a rough shift without absorbing all of that cost themselves. Why Fairness Matters One of the easiest mistakes teams make is to look at on-call only in aggregate. They count total incidents, total alerts, or how often each person appears on the schedule. Those numbers do matter, but they can hide the part people actually experience: whether the burden is shared evenly. A rotation can look balanced on paper and still lean too hard on a small group of engineers. It usually shows up in familiar ways. Certain people get pulled into the hardest incidents even when they are not on call. Certain systems effectively belong to one or two individuals. Certain responders become the unofficial second line whenever something ambiguous or high stakes happens. That kind of unevenness does real damage over time. People do not experience on-call as an average. They experience it as interruption, stress, context switching, and responsibility. If those things keep landing on the same few people, the rotation is unhealthy even if the total volume looks manageable. This is why fairness belongs at the center of any discussion about on-call health. The question is not only how much work exists. It is also who is repeatedly being asked to carry it. Sleep Is Part of the Cost Teams also tend to underestimate the cost of nighttime interruption. It is common to treat overnight incidents as an unfortunate but routine part of keeping systems running. In practice, the effects last much longer than the incident itself. Someone who has been paged in the middle of the night is not starting the next day from the same place they would have otherwise. Focus is worse. Patience is worse. Deep work takes longer. The person may still be online, still in meetings, and still technically available, but that should not be confused with being fully recovered. A lot of teams count the incident but not the impact of the incident. The page is visible. The lost sleep is acknowledged for a few minutes. Then the planning assumptions for the next day stay the same. The same deadlines remain. The same meetings stay on the calendar. The same expectations hold, as though responding overnight had no operational cost beyond the time spent during the incident itself. That is not a healthy way to run on-call. If the system required someone to give up sleep to protect production, the team should treat the consequences of that interruption as part of the work. Otherwise the cost does not disappear. It just gets pushed onto the responder privately. Heroics Are a Warning Sign Most engineering organizations have a few people who are especially strong in incidents. They are calm, fast, and deeply familiar with the systems around them. In the moment, it is easy to see those people as proof that the team is good under pressure. Sometimes that is true. More often, repeated heroics are pointing to something less flattering. When the same engineers are always the ones with the missing context, the ones who know where to look first, or the ones everybody expects to step in when things get messy, the team is relying on individual knowledge in places where it should be relying on clearer systems. The problem is not that strong incident responders exist. The problem is that too much of the organization’s operational resilience may live inside a few people instead of inside shared processes, documentation, and ownership boundaries. Over time, teams normalize this without meaning to. Certain engineers become the real safety net for the rest of the system. They may not be the only people on the rotation, but they are the people everyone assumes will get involved if the incident is serious enough. That is not a durable model, and it is not a good definition of health. Good on-call should not require regular acts of rescue. It should require clear systems, good judgment, and enough shared context that the job does not collapse onto the same people every time something goes wrong. Clear Ownership Changes Everything A large part of what makes bad on-call stressful is not only the work itself, but the uncertainty around it. An alert fires and the responder has to figure out whether the service is theirs, whether the signal is trustworthy, what a safe first step looks like, and when someone else should be brought in. That kind of ambiguity is expensive, especially at the start of an incident. Clear ownership changes the experience of being on call more than most teams expect. When responders know which systems they are responsible for, what the escalation path is, and what a reasonable first response should look like, incidents become easier to navigate. They may still be difficult, but they are less chaotic. There is also a broader team benefit. Clear ownership reduces unnecessary escalation, limits off-rotation interruption, and makes it easier for engineers to build confidence over time. It turns incident response into something that can be learned and supported, rather than something that depends on improvising through confusion. A lot of on-call pain comes from uncertainty that should have been resolved long before the page ever fired. Ownership does not solve every problem, but it removes a surprising amount of avoidable strain. Runbooks Make On-Call Easier Runbooks are often discussed as a speed tool, which they are. But they also play a much more human role than that. Good documentation lowers the cognitive load of responding to incidents. It gives people a place to start when they are tired, under pressure, or less familiar with a service than the person who originally built it. That matters because on-call gets worse as soon as too much of the process depends on memory. If the only way to respond safely is to remember exactly how a fragile system behaves, or to know which person to message, then the rotation is not really designed to support the team. It is designed around the assumption that someone with enough context will always be available. Usable runbooks make the job less brittle. They help more engineers respond effectively. They reduce hesitation. They reduce the need to pull in the same expert every time something breaks. In that sense, runbooks are not just documentation. They are part of what makes the rotation workable. Teams sometimes treat runbooks as a nice-to-have that can wait until later. In practice, they are one of the clearest ways to make on-call less stressful for the people actually doing it. Recovery Should Be Expected The final test is what happens after a bad night. If someone handles an incident at 2 a.m., does the team actually adjust the next day? Or does everyone behave as though the incident is over and normal expectations immediately resume? That answer reveals a lot about how a team thinks about on-call. Unhealthy teams tend to treat recovery as an individual concern. Healthy teams treat it as part of operational planning. That does not mean every alert requires a dramatic policy response, but it does mean the organization recognizes that incident response takes something out of people and plans around that fact instead of ignoring it. Sometimes recovery looks like moving a meeting. Sometimes it means shifting a deadline, redistributing work, or simply acknowledging that the person who handled the incident is not starting the day at full capacity. The specific response matters less than the principle. If incident response has a human cost, the system should account for it. When recovery is left to personal sacrifice, on-call starts to feel like work that counts twice. People do the incident work when it happens, then quietly pay for it again afterward by making up lost time on their own. That is one of the clearest signs that a rotation may be functioning, but not functioning well. A healthy rotation is not defined by silence. It is defined by whether the system is asking people to do work they can realistically sustain over time. If the load is fair, ownership is clear, documentation is useful, heroics are rare, and recovery is expected, on-call starts to look less like a recurring tax on a few individuals and more like a mature way of supporting production. That is the standard more teams should be aiming for.