500 reads

Achieving Optimal Service Reliability: Insights Into Service Level Objectives (SLOs)

by Daniil MazepinOctober 9th, 2024

Too Long; Didn't Read

Service Level Objectives (SLOs) are crucial in balancing service reliability with innovation in software engineering. This article explains the importance of SLOs, key terminology, and practical steps for implementation, emphasizing the need for realistic and achievable goals that align with user expectations and business objectives.

featured image - Achieving Optimal Service Reliability: Insights Into Service Level Objectives (SLOs)

Hey there! My name is Daniil, and today, I want to share some thoughts on reliability. I’m sure all of you know what reliability means, but I’d like to emphasize one particular angle – reliability from the end users’ perspective. I’m also certain many of you have faced the challenge of prioritizing between new features driven by business needs and technical excellence driven by engineers. So, can SLOs help us tackle both of these challenges?

Before we dive in, let me share a bit about myself and why I’m so passionate about this topic. I’m currently an Engineering Manager at Teya, a fintech startup whose mission is to empower small businesses across Europe with the best financial platform. I support backend teams on the Acquiring side of the business, where, as you might expect, reliability is crucial. Prior to Teya, I worked at Meta (formerly Facebook), where I also supported backend teams as we prepared to launch a new e-commerce platform across the Family of Apps. Ensuring everything was up to the highest standard for the launch was a key focus, and this is where my passion for reliability truly developed.

Now, let’s get back to the topic, and I’d like to start by covering the key terms.

Key Terminology

Reliability: The system’s ability to perform as expected when required. Reliability is not just about uptime; it’s about meeting user expectations. If your system quickly returns errors, users won’t be satisfied. Therefore, user happiness should be at the core of reliability considerations, as we build products for the users, shouldn’t we? But how can we measure it?

Service Level Indicator (SLI): A quantifiable measure of service reliability, reflecting how well the service is performing in a specific aspect, such as latency. This metric should closely correlate with user satisfaction. Think of it as the percentage of "good" events out of the total “valid” events.

Service Level Objective (SLO): The target value for an SLI, setting the goal for acceptable service performance. For example, an SLO might state that availability must exceed 99.9% (three nines) or that 99% of requests must have a latency below 3 seconds. Here, 99% is the SLO and 3 seconds is the threshold. A correctly defined SLO should set a clear boundary: meet it, and end users are happy; consistently violate it, and users will complain or even stop using the service.

Service Level Agreement (SLA): A legally binding contract between a service provider and its customers, specifying the penalties if an SLO is not met. While important, SLAs are more about legal obligations than day-to-day operations in Site Reliability Engineering (SRE).

Error Budget: The amount of unreliability that is acceptable, calculated as 100% minus the SLO. Read that again. SLOs imply accepted unreliability. For example, if your availability SLO is 99%, it means you have 1% of requests within your Error Budget that might fail. This budget represents the risk threshold for service degradation and can be used for experimentation or innovation. We are free to, in fact, we must use this budget. Ideally, the error budget should be fully utilized within a set time frame.

Error Budget Window: The time range over which SLOs and Error Budgets are measured. This could be fixed (e.g., per week or per month) or rolling (e.g., the last 30 days). A fixed window is useful for internal reporting, while a rolling window better reflects ongoing user experience, as users’ trust doesn’t magically recover on the first day of each month. To properly define Error Budget Windows, you need to understand user behavior patterns. For example, if the vast majority of users interact with the service only a few times per month, particularly at the end of the month to generate reports, then a 7-day rolling window would not be appropriate.

Burn Rate: The rate at which the error budget is consumed. A burn rate of 1 indicates that the error budget will be fully used by the end of the period, which is ideal. If the burn rate is below 1, you have more room for experimentation. If it’s above 1, you risk depleting your error budget prematurely, which could lead to service issues.

Why SLOs?

SLOs are crucial for making data-driven decisions that balance technical excellence with product increments. Properly defined SLOs reflect user happiness; consistently violating them means you’re continuously disappointing users. This makes SLOs a powerful tool for prioritization, ensuring that critical issues are addressed before less impactful enhancements. Obviously, nobody needs a new feature on the thirty-fifth screen in the app if the most-used feature takes ages to run.

The beauty of properly defined SLOs is that they stop being purely technical terms and start speaking a language that is understandable to the business. In the past, I managed to reshuffle fully packed roadmaps multiple times to prioritize performance and reliability work, which is usually very difficult to prioritize due to the absence of visible business value, just by having the right SLOs.

Moreover, adopting an SLO framework across all teams in an organization ensures everyone is aligned. When all upstream and downstream dependencies use SLOs, it helps teams understand what commitments they can make, as a service can’t be better than its upstream dependencies. It also keeps teams accountable. If one service fails to meet its SLOs due to an external dependency, it still impacts the overall user experience, underscoring the interconnected nature of service reliability.

Where to Start?

Okay, I hope by this point you’ve already decided that SLOs will be your next priority. Great decision! But where to start? What should you do? There are so many options for defining an SLO, but ironically, very few will actually do the job.

Basically, to get on board with SLOs, you need just three things (well, maybe four): a metric, target value, window, and sometimes, a threshold. So, how do you pick them?

Picking the Metric

Choosing the right metrics to monitor is the first step in implementing SLOs. Prioritise metrics that:

Directly Reflect User Happiness: Metrics should have a clear, linear relationship with customer satisfaction. It’s the cornerstone of this exercise being successful. What are the most important features the service provides for users? In the payments world, it’s the ability to make a transaction quickly and reliably. In the e-commerce world, it’s the ability to buy a product. And so on and so forth.

As you can see, the prerequisite for this step is to make sure you understand the business and its clients. What do they do, and what do they want? Once you know the answers to these questions, picking the right metric will be easy. You wouldn’t even consider using CPU usage, for example, as you will know that this metric by itself has almost zero impact on the user experience. However, latency for certain endpoints might have a very high impact.

Correlate with Outages: Choose metrics that reflect service issues, not ones that remain stable during outages.

Provide Clear Signals: Avoid metrics with high noise; the data should offer actionable insights.

There are a few common metrics that might be a good starting point depending on the type of service you have:

Request-driven Services:

Availability: The fraction of valid requests successfully served. This measures system uptime and responsiveness.
Latency: The fraction of requests served within a specific time threshold, indicating system speed.
Quality: The fraction of requests served without degradation, ensuring consistent service quality.

Data Processing Services:

Coverage: The percentage of data processed, ensuring completeness.
Correctness: The percentage of output data deemed correct, verifying accuracy.
Freshness: The recency of the source or output data, expressed as a fraction, indicating how up-to-date the data is.
Throughput: The fraction of time when the processing rate exceeds a threshold, measuring efficiency.

For request-driven services, availability and latency are usually a win-win approach if you’re just getting started. In most cases, we start with these, so I’d recommend you do the same. With data processing services, it’s a bit more complicated, as it really depends a lot on the nature of the service and its usage. So here, you’ll need to assess it yourself. What’s important is to not pick all of them from day one. Select one, a maximum of two, and start down that route. “The man on top of the mountain didn't fall there.” © Vince Lombardi. Start small, and then iterate.

Another interesting question is how to define “valid” requests and “success” for availability, for example. As a starting point, calculating the number of 5xx errors from all requests would be a good approach. But as you iterate, you’ll unpack many more interesting questions about the nature of the errors. Just please, don’t overcomplicate things from the beginning.

Defining the Target Value

The concept of reliability often revolves around the number of "9s" (e.g., 99.9%, 99.99%). However, aiming for 100% reliability is impractical and typically incorrect. As Ben Treynor Sloss, founder of Site Reliability Engineering (SRE) at Google, stated, "100% is the wrong reliability target for basically everything."

Instead, focus on setting realistic and achievable SLOs that align with user expectations and business goals. How do you find this ideal value? You won’t be surprised to hear the same advice as in the previous section—iterate! “Picking the wrong number is better than picking no number.” © SRE.Google.

Start by familiarizing yourself with the table that converts a specific number of "9s" into minutes of unreliability per day, week, or month. This will give you a practical understanding of what’s possible and what’s not. For instance, anything below 10 minutes of downtime is usually achievable only with automation and no human intervention.

If you have historical data, use it to define your initial target. If you don’t, begin with two or three nines, depending on your service's maturity. And remember to iterate as you gather more data and insights.

Also, keep in mind that adding another "9" often comes at a steep cost, with the price of additional reliability increasing almost exponentially. It’s always a good idea to assess whether those costs are justified based on the value it brings to the business.

Summary

In the realm of software engineering, striking the right balance between service reliability and rapid innovation is a challenging task. Service Level Objectives (SLOs) are essential for managing this tension. By defining and measuring SLOs, organizations can ensure that the push for fast development does not compromise the reliability of their services. SLOs provide a structured framework to maintain reliability while still prioritizing user satisfaction and allowing room for innovation.

Setting SLOs is an iterative process. It’s better to start with an imperfect target and refine it over time than to avoid setting a goal altogether. This approach allows teams to gather data, learn from experience, and make incremental improvements.

A Practical Plan:

Identify Critical User Journeys: Focus on the most important user interactions with your service. Determine the key areas where reliability is crucial.
Define SLIs: Choose metrics that best represent the reliability of these critical journeys. Ensure they are measurable and meaningful.
Set SLOs: Establish realistic targets for the chosen SLIs, considering user expectations and business goals.
Monitor and Iterate: Continuously track SLIs and compare them against the SLOs. Use this data to make informed decisions and drive improvements.
Communicate and Align: Ensure all stakeholders understand the SLOs and their significance. Align the organization's efforts towards achieving these objectives.