paint-brush
The On-Call Process: 7 Reasons It’s Not Up to Par and Ways to Improveby@tmihai

The On-Call Process: 7 Reasons It’s Not Up to Par and Ways to Improve

by Tatsiana MihaiJuly 6th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Tatsiana Mihai gives her tips on how to improve on-call experience. She explains how to avoid problems that can lead to burnout and leave the project. She also gives her top tips on dealing with non-engineering tickets on the on- call system.

People Mentioned

Mention Thumbnail
featured image - The On-Call Process: 7 Reasons It’s Not Up to Par and Ways to Improve
Tatsiana Mihai HackerNoon profile picture


Once it's moved from the internal testing stage to acquiring real users, each product begins to receive complaints about non-working things. The more critical the problem, the less time users are willing to wait. That is why almost all companies are warming up to the idea of having an on-call process.


Unfortunately, if poorly organized, it can result in a drop in product metrics, slow development, and ruin the engineers' work-life balance, which in the worst-case scenario, leads to burnout and the decision to leave the project.


If you notice that on-call is seen as punishment, it’s time to act.


Most of the issues are caused by technical debt or process issues. Moreover, one often becomes the reason for the others. It’s good to remember a simple rule: Every issue that appears during the on-call must be followed by an action preventing its reoccurrence. It can be a code change, process improvement, or documentation amendment. Now, let’s dive into the details.


Content Overview

  • Reasons why on-call is not getting any better and ways to improve

    1. You don’t know your services well enough

    2. You get too many non-engineering tickets

    1. You don’t share knowledge within the team
    2. You don’t prioritize incoming issues
    3. You endorse hotfixes
    4. You allow personal accounts
    5. You don’t eliminate flakiness
  • Why it’s crucial to analyze the on-call process

  • Conclusion



Reasons why on-call is not getting any better and ways to improve

1. You don’t know your services well enough

Poor service knowledge can be seen in teams where the composition frequently changes without proper knowledge transfer. Sometimes the reorganization of the company structure is followed by service transfer to new owners, which makes the first on-call shifts the most difficult. Also, if the product consists of many services and a familiar 2-pizza team can’t physically manage all of them, it’s unlikely for one engineer to have deep knowledge of all dependencies.


In addition, Customer Support is rarely aware of the internal architecture and creates a ticket for the first service with which users interact, leaving further triaging for engineers. If the team is not confident in service stability or unaware of all the dependencies, an on-call engineer is likely to spend some time looking into services that operated fine from the start.


Practice case

An on-call engineer received a task to triage a payout discrepancy as the difference in numbers between what the users could see in their web accounts and their bank statements was quite big. It took a day to validate UI, API, and data pipelines, in parallel sending emails to get customer confirmation to assess the profile. When it became apparent that the services were healthy (supported by monitoring data), it was decided to re-assign the issue to a team responsible for collecting source data. They resolved the task within 5 minutes as the case was well-known.

How to avoid

An excellent way to minimize the risk is to ensure everybody knows where to start if the incident occurs. You can create a monitoring dashboard with key metrics and an on-call runbook with the most frequent cases and include them in the onboarding plan. Once an incident happens, every engineer should be able to evaluate the system's state based on the information from these 2 points. If the case appears for the first time, an engineer might extend the monitoring or add a new case in the runbook as a follow-up action.


If you know what services live behind yours and could be a source of the issue, it’s OK to involve the responsible team in triaging to speed up the process. However, remember that your team owns the on-call issue until you provide a valid reason why the case must be re-assigned, so continue the further investigation even if somebody has agreed to help.


Once you see you’re getting similar issues by mistake (we’ll talk later about tracking it), it’s time to start changing the task assignment process. If you utilize an automatic system, consider adding rules to distinguish your services from others. Also, make sure Customer Support’s documentation contains steps they could use to triage the issue in more detail and assign it to the right team.


2. You get too many non-engineering tickets

During on-call shifts, I often received false requests that could be resolved without involving an engineer. All I could do was comment like “this is by design” and reference the documentation.


It’s pretty easy to determine this type of task: you don’t take any action that somehow affects the service, like creating a PR or re-running a job. Most likely, the reason is poor UX, overwhelming public documentation, or both. If customers aren’t sure how to navigate among sophisticated features, they could generate a curious but invalid outcome that won’t work as expected. Or, when the next step isn’t clear, it’s easier for customers to ask for support rather than deal with the issue themselves.


Sometimes even an improvement can be the reason for changing the service without notifying customers. A new flow might be confusing or even break the existing integrations, which will also raise the number of on-call tickets.


To deal with it, the engineer needs to gain deep domain and service knowledge, which is possible after working for a company for a while. Otherwise, be ready to spend some time before you figure out that "blocking an account after three wrong password attempts" is done on purpose.

Practice case

The service was designed to integrate with 3rd parties by generating a token with an expiration period. Once the period is passed, the user has to repeat the process. Because of the poor product design, the token generation step was included in the onboarding flow. There was no need to return to this page once the setup was completed, as it contained no helpful information. As you probably can get, every time a token had expired, the on-call engineer got a new issue: "Client N doesn’t get data more than M hours, API responds with an error" and after triaging, had to redirect the customer to the onboarding page to regenerate the token.

How to avoid

You can often see the relationship between the occurrence of such tasks and insufficient information at the time of the error. Luckily, the solution is pretty elegant: you need to change the service so that the user can understand what is wrong. For example, the message “Something went wrong” is much less suitable as a description of an error than "Changes for the price can’t be applied as your campaign is active. If you want to change the price without affecting other settings, you can finish this campaign and run a new one by cloning settings". Work on such tasks together with UX specialists to find the optimal solution.


Do not expect that adding all the details to the public documentation will dramatically change the number of incoming tasks: in the case of large systems with complex configurations, users prefer to contact support directly - it's faster and more efficient. Review the documentation Customer Support or CSM use for triaging and make sure the information there is up to date as sometimes they contain detailed instructions with snapshots and links that might be outdated.


3. You don’t share knowledge within the team

This large category is typical for companies that support expertise in specific technologies and encourage growing individual product experts in the team. Add to this outdated documentation, and you'll get a very painful on-call with many constraints, which for some, turns into the lifelong role of a support engineer.


In addition, it extends the risk of slowing down product development, not to mention limiting domain knowledge among a very narrow circle of specialists. The new employee onboarding to such products is slower and less qualitative, as information is transmitted selectively, and some services can be skipped due to tech stack unmatching. Often all happens in the form of a live meeting without recording and with a newly created whiteboard of dependencies that will be lost right after the call. I've ever worked with a product where, after an expert decided to quit, the team couldn't upgrade the service safely as nobody knew how a couple of features functioned.


Regarding the on-call process, the knowledge might be more specific to triaging and taking quick actions in an emergency. If more experienced engineers don’t share knowledge, the other engineers will always ask for help, distracting teammates from general product work.

Practice case

After another reorganization, the teams that used to be divided by the technology stack were united into product teams supporting UI-API-Data Management. Since most of the issues were related to data inconsistency and unexpected shutdown of instances, front-end engineers were not initially involved in the on-call. At some point, rotations among back-end engineers became too frequent, slowing product development, and it was decided to affect everybody.


The first shifts revealed that front-end engineers could not cope independently: there were no easy ways to get why the alert triggered, what instance was affected, and how to recover it. Due to the different use of tools for the front-end and back-end, the engineers had neither links nor access to each other's monitoring systems. All knowledge was passed on verbally (hey, tribal knowledge) or acquired while touching the service codebase.

How to avoid

There are several effective methods here. First, ensure that the domain and service data are stored in a place accessible to everyone. Follow the same template for all services and automate where possible. For example, if you're keeping public URLs for different environments on Github, create a README template containing this section and add linting rules to remind engineers to fill it. For sensitive information like credentials, specify where to find them for each environment. It's acceptable to use multiple documentation services simultaneously, as sometimes one tool is helpful for sales when another is for engineers. Ensure it's up-to-date and consistent; everyone on the team knows how to find them. To make on-call easier, you can create a runbook. As a rule, it is a separate document that covers the most useful information to analyze tasks easier, especially for beginners and under the pressure of urgency and severity.


It's vital not just having documentation but to make sure each engineer knows about its existence. Include links to the onboarding plan, appoint a more experienced teammate as a buddy to help with not covered questions, and conduct workshops. Hint: try to make those recently completed onboarding become buddies for the next newcomers - a deeper service understanding is guaranteed.


If someone decides to leave the team, make sure that they prepare a handover plan with information about the active projects. Notify other engineers in advance and allow them to prepare questions, which will help cover edge cases. Record the meeting: for some time, this knowledge will be relevant, and by this, you also resolve an attendance problem.

The positive result for on-call is showing the practice of the shadow on-call. Commonly, it requires two engineers for a shift: the primary on-call is responsible for incoming tickets, and the shadow is ready to cover the main in case of an emergency like a 1-day absence or support with a severe incident. Newcomers are put as shadows for initial shifts to understand the process. You can arrange paired sessions with the leading on-call engineer for a task from the moment it arrives till it's solved. That will help to nail tiny details such as which commands to call or what internal tools to use to look for resources. In some cases, shadows can take the lowest priority tasks and try to solve them independently.


The last thing I'd like to mention is the ending of an on-call shift. In many companies, it's pretty straightforward, one person's shift has ended, and the next person's shift has begun. However, you risk losing context for non-finished tasks. Try to introduce the practice of a small meeting for the team where the on-call engineer sums up their shift: what tasks were completed and what caused them, where they stopped at the tasks in progress, and which tasks are blocked. Thus, all team members will be equally aware and better oriented if given a similar task during their shift.


4. You don’t prioritize incoming issues

For large systems, it’s common to receive several on-call tasks at the same time. It would seem that prioritization does not make a difference because, in the end, they all have to be resolved. But if you look at on-call deeper and evaluate how many people were affected and how much profit was lost, then the importance of prioritization becomes clearer. Moreover, late reaction to more critical issues often causes cascading crashes, leading to more affected services and more drastic measures to restore them. This is quite typical for services with multiple servers and load balancing. In the event of a shutdown of one instance, there is a high probability of subsequent shutdown of the others (if autoscale is not provided). The late reaction is fraught with longer and more infra-consuming re-processing for data-processing systems.

Practice case

On a first come, first served, an on-call engineer selected a task created by another engineer about problems with a test account. A little lower in the same list, there were several similar tasks from the notification system for a daily pause in the data processing. When customer emails began to arrive in bulk, the processing hadn't been working for several days. The issue was tiny: a new data type wasn't handled properly, and instead of ignoring the unknown input, all calculations for further rows were paused. The error was fixed, and the processing was restarted. Still, for some time, the system used all available computing resources, not to mention that the company had to offer some compensation to customers.

How to avoid

The solution can be divided into two parts: choose a prioritization system and implement it in creating tasks. For the first part, it is necessary to determine which metrics are more significant for the product. For example, if you're developing mobile games, your metric might be the number of users (existing or anticipated) affected by an issue. For B2B, you can evaluate the criticality of the problem based on the daily loss of profit or the expected downtime compensation. Collaborating with PMs, Engs, and Customer Support is the best way to develop the prioritization scale. It's good to introduce a deadline for each task as it helps to order tasks for triaging properly.


5. You endorse hotfixes

Hotfixes are those little elves who rescue situations where the amount of potential damage is significant, and there is no time for an elegant solution. Who hasn't added sleep(N) because otherwise, "the previous method hasn't been completed yet"? But a hotfix is a hard-coded solution and any inefficient action an engineer takes. Manually updated configuration files on the server, or manually reinitiated instances aren't acceptable day-to-day with an existing CI/CD but are handy in a critical situation. Since this is almost a guaranteed working approach, tasks for improvements usually live somewhere on the bottom of the backlog. Having a task or an TODO in the code is the best-case scenario; at worst, engineers will use the “unplug and plug back in” technique for any issues. Sticking to hotfix practice raises the risks of system stability and maintainability and potential code conflicts between hotfixes and standardized codebase. As a result, more on-call issues.

Practice case

The team developed a new version of the API and, under the pressure of the deadline, decided to start using it without properly configuring DNS. Instead, it was decided to configure dependent services to call it by IPs. Tasks for replacing hard-coded configurations with standard ones were planned but postponed, as often happens due to other priorities. One day, the infra department did some maintenance, and as a result, IPs changed, and services outside the DNS became unavailable. It took time and effort to find dependencies (yeah, it appeared that a few dependencies had changed owners, and the new team wasn't aware of the hotfix), reconfigure, and restart the services. During this time, the product was unavailable to customers.

How to avoid

Remember the rule: each task must have some action to prevent recurring. Therefore, fixing the hotfix usage as a comment to the on-call task or as a paragraph in the postmortem document is essential. Later it'll be converted into a tech debt task whose status can be controlled. Always link tasks to keep context and visualize the size of the issue if the hotfix is applied regularly. If you mark the code with TODO provide a link to the task next to it, as often not tied TODOs live forever in the codebase and don't lead to any solution.


If the task was created long ago but remains untouched, you need to review the prioritization policy. Sometimes it's difficult to negotiate such tasks with the leadership since there's confidence that 10-25% of the total development time allocated for tech debt should cover such work. This is where on-call tracking becomes handy. Try to start tracking tasks for which all postmortem dependencies aren't resolved and several on-call tasks linked with the hotfix removal task. This will be a much worse but more truthful metric and allow the leadership to make better estimations relying on the technical condition of the product.


Regarding prioritization within a team, post-incident tasks should be given extremely high priority, such as security risk tasks. If it's impossible to eliminate the hotfix quickly, you need to evaluate the changes from the perspective of business - affected revenue, infrastructure cost, cost of maintenance, and so on. As a result, you either convert the work into a project for the whole team or realize that the price of changes is too high and describe the hotfix in the runbook as an acceptable solution.


6. You allow using personal accounts

Using personal accounts, even corporate ones, in the development pipeline rarely ends well. The fact that using personal accounts outside the corporate domain is acceptable creates additional security and data privacy risks. Also, it adds extra work to IT teams by increasing complexity in monitoring activity and costs for 3P services and deactivating employee accounts when the contract is terminated. Even If the company doesn't allow using personal accounts outside the system, you can still notice linking products and services to an engineer account. You can find it using personal email for server access, DB user, user for autotests, or storing project documentation in a private Google drive. In an emergency, an on-call engineer has to improvise, contact the IT team to grand assess, or be ready to mindlessly "change and redeploy" until the issue is resolved. If the team is distributed across continents, be prepared to wait for a while: once I received an issue about data processing and didn't have access to the jobs running tool, I had to put the task on hold for 8 hours until the permit was granted.

Practice case

The code analysis system found a "dead" endpoint, created a task about automatic removal within two weeks, and notified an engineer to take action in case the endpoint was needed. The notification rule was designed to send all notifications to a particular engineer who eventually decided to leave the team. The team had missed the notification, and the endpoint was not dead. It was deleted. The on-call engineer had to deal with re-creating configs and reverting all changes.

How to avoid

As obvious as “don’t allow personal accounts” may sound, the reality is much more complicated, especially for early-stage startups.


The first weak point is onboarding. On the first working day, a newcomer gets an empty laptop and the opportunity to install everything to their liking, and this is the risk of getting software without a license and services registered on personal accounts. Presetup and using SSO will help to reduce the effect. Engineers are often forced to create accounts to glance at the product. To add a bit of control, the first task for a new project (or one you’ve received from another team) is to create all possible accounts and placeholders, even if they remain unused for a while. This could be email addresses for the team and test users, a documentation page in a shared space, and so on. In the future, the engineer will have much less need to re-create accounts.


If you still allow the use of personal accounts, make sure that all team members have extended permissions. For example, if your Github account has only one admin, the whole team will be blocked with minor improvements like adding a webhook.


The introduction of regular automatic checks will help deal with personal accounts that have already entered the system. Static code analyzers can help find account usage in the codebase; scheduled jobs can collect data from databases and flag accounts with unwanted settings (e.g., non-corporate email + test user + admin role). 3P services usually offer good analytics, so you can quickly get user data and take actions to restrict access if personal accounts are used.

It is not always possible to get rid of unwanted accounts quickly: in extreme cases, essential flows and accesses can be tied to a single email. Organizing a separate event like a hackathon is helpful here as you can bring in engineers from different teams and ensure the replacement won’t break CI/CD in the middle of the release.

7. You don’t eliminate flakiness

Most of the time, flakiness can be seen in alerts or tests, and the most annoying are categories “too many false positives” and “too frequent notification.” Both lead to the same outcome: engineers just stop paying attention to them and react to the issue later than they could.


Alerts are usually not documented and sometimes contain a description that doesn’t explain its purpose. Too frequent alerts are an example of over-engineering: with good intentions to make a system as safe as possible, engineers trigger alerts on every exception and set too-low thresholds. Another way to make on-call worse is to raise the same alert in all available channels: send an email, create a task, use all Slack options, etc.


As for the tests, the amount of QAs in the team is much smaller than engineers, so tests are released later than features and become outdated relatively fast (relevant for other team compositions such as “no QAs,” “QA Platform,” etc.). Tests for UI components are at the highest risk of becoming flaky, mainly if meant to be pixel-perfect. E2E testing for complex systems with authentication, long-lasting data processing, and stateful dependencies can become flaky due to timeouts. Tests themselves aren’t the reason for on-call worsening. However, the capacity to create them usually comes from the same budget as tech debt tasks. If poorly designed, they will distract on-call engineers more than they serve for service resilience.

Practice case

That was one of the funniest cases in my practice. I worked on a feature and introduced a UI bug in a list container - the scrolling functionality was blocked, and only a visible set of items was available. I used a minimal setup during the development, so I missed it. At that time, a few integration tests were flaky as they relied on the find-and-click approach. The production environment had better resources than the preproduction, so the latest faced flakiness much more often. When I ran the preproduction tests, they failed, but as it wasn’t a unique case, I just notified the QA engineer and moved forward to production. When the production tests failed, the QA engineer started checking the tests themselves (hi, low service confidence), and we started receiving on-call tickets. Luckily, the revert was done quickly, but my colleagues still reminded me about this case.

How to avoid

If the percentage of flakiness is high, consider switching off alerts/tests. It might sound unbelievable, but if the outcome is false positive 9 of 10 times, it distracts you more than helps. Do a proper review and remove the reason for the flakiness before returning a check to the pipeline. For example, check that alerts have a threshold consistent with a service throughput, significantly when the service has been upgraded and scaled. For tests, it’s good to start by removing hardcoded timeout values or instances that could easily change (object position on the screen, a matter of the input label, data loaded in the collapsed component, etc.).


To reduce noise and minimize the number of notification channels, I assign tasks to the on-call engineer for low-mid priority cases and tasks + DM for high-priority. You might keep a team channel for the most severe cases, but only if you expect somebody except the on-call engineer to be involved.


Many examples use tests as a service health check by running them continuously on production. If you rely on this approach, try focusing on parts most likely to crash unexpectedly. 3P service can become unavailable or change the response format, which can cause a system failure. Just remember, tests might be long-running and cover only core functionality; if you want a better result, consider building a scheduled job instead.


Why it’s crucial to analyze on-call

After a short time of using the labels, the first on-call analysis helped to identify that most of the load fell on small tasks (the triage required less than 1 hour) closed without a fix or transferred to another team. After the tasks were grouped and given a list of high-level issues:


  • The incomplete transition of ownership of projects after reorganization forced the automatic system to assign tasks to the wrong team

  • Changing a product over the years without a UX review resulted in ambiguous notifications and error messages that confused users.

  • The lack of access to some internal tools deprived the support team of the opportunity to receive detailed information and forced the involvement of engineers.


The team has received enough information to develop solutions that will not only solve current problems but prevent similar ones in the future. For example, the code review system automatically notified the specialists working with content whenever it detected a new or changed text message. It allowed them to prevent moving forward the changes that aren't aligned with the content strategy and, therefore, avoid message inconsistency.


Conclusion

There is no magic pill to eliminate on-call, and I wouldn't say it's needed as on-call activity shows that your product is live, evolving, and ready to address customer requests. However, it's essential to keep the balance between time to create new features and to maintain the existing ecosystem. I hope you'll find techniques to help you make the on-call process less frightening and the product more resilient.


Also published here.