If you work in infrastructure technology, chances are you spend a lot of time working with IT operations teams. You’ve watched them invest lots of hard work trying to meet the expectations of the business, but they’ve come away with limited success. The business continually bashes IT for providing poor service, while IT struggles to meet seemingly nebulous expectations with limited resources. The major problem here is the fundamental disconnect over how IT and the business each measure success.
IT is responsible for sharing limited resources (such as CPU, memory, and disk) between business functions, so they measure consumption. IT then uses those metrics to recognize when a resource is close to exhaustion to avoid problems and keep costs low. On the other hand, the business needs responsive and error-free services, so they measure success using speed and quality. The disconnect is two teams with drastically different definitions of success, resulting in lots of tension between IT and the business.
If you want a simpler and more responsive observability practice, tighter alignment with the business, and faster paths to improvement, you should focus on service-level metrics instead. In this article, I’ll introduce two metrics that should matter for your observability practice – service level indicators (SLIs) and service level objectives (SLOs) – and I’ll show you how to set your SLOs.
An SLI is a carefully defined quantitative indicator of some aspect of the level of service that is provided. In other words, an SLI is a metric measuring one thing that shows how well your IT service is performing. An SLI must be relevant to the delivered service and should be simple and easy to understand. In other words, when an SLI goes wrong, there must be some business impact, such as an outage or poor user experience. Remember, the business expects speed and quality, so you need to choose SLIs (metrics) that measure those things, such as:
Yes, there is a distinction between uptime (reliability) and availability (time loss). And here are some potential SLI choices that you shouldn’t use because they don’t directly correlate to business impact:
Again, the main difference between a good and bad SLI is the metric’s relevance to service delivery. A high error rate or slow response time affects service delivery. High CPU utilization might impact service delivery, but the relationship between CPU and service performance is harder to establish. This is why IT teams measure resource consumption struggles.
The key here is to pick a metric for your SLI that is clearly and unambiguously related to service delivery and is simple and easy to communicate to non-technical people. That will resolve the disconnect, making things easier for everyone involved.
An SLO is simply a goal that you set for your SLIs. First, you identify your SLIs. Then, by setting thresholds for each SLI, you create your SLOs.
SLOs should be easy for even non-technical stakeholders to understand. Stand-alone resource consumption metrics, such as CPU utilization, don’t tell you if something is performing well or not—they require interpretation by an SME. Identifying business-impacting SLIs, setting SLOs, and properly presenting them means that the consumers of those SLOs don’t have to ask if the number is good or bad. Interpretation is intuitive—the answer is “good” or “not good.” As a bonus, it’s easy to use SLOs to measure improvement.
If the business or IT management has already set SLOs for you, then you’ll want to use those. If they haven’t, I recommend using an iterative approach as follows:
Establishing SLIs and SLOs will result in a simpler and more responsive observability practice, tighter alignment with the business, and a faster path to improvement. It’s simple and easy to get started, practice this on one service and see how well it works.
Also Published here