paint-brush
Risk Tolerance of Servicesby@jerub
551 reads
551 reads

Risk Tolerance of Services

by Stephen ThorneMarch 23rd, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

<em>I am a Site Reliability Engineer at Google, annotating the </em><a href="https://landing.google.com/sre/book.html" target="_blank"><em>SRE book</em></a><em> in a </em><a href="https://medium.com/@jerub/commentary-on-site-reliability-engineering-9ba9e1be2a8c" target="_blank"><em>series of posts</em></a><em>. The opinions stated here are my own, not those of my company.</em>

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Risk Tolerance of Services
Stephen Thorne HackerNoon profile picture

How to decide how fault tolerant you really want to be and defining the value of reliability.

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

This is commentary on the second part of Chapter 3: Embracing Risk. Written by Marc Alvidrez, edited by Kavita Guliani.

Risk Tolerance of Services

What does it mean to identify the risk tolerance of a service? In a formal environment or in the case of safety-critical systems, the risk tolerance of services is typically built directly into the basic product or service definition. At Google, services’ risk tolerance tends to be less clearly defined.

Inside Google, identifying risk tolerance is getting better. A centralized dashboard showing system reliability thresholds has been very useful. This has the added benefit that it’s a “living document” — it shows the reliability goals of systems, as well as their current behavior!

To identify the risk tolerance of a service, SREs must work with the product owners to turn a set of business goals into explicit objectives to which we can engineer. In this case, the business goals we’re concerned about have a direct impact on the performance and reliability of the service offered. In practice, this translation is easier said than done. While consumer services often have clear product owners, it is unusual for infrastructure services (e.g., storage systems or a general-purpose HTTP caching layer) to have a similar structure of product ownership. We’ll discuss the consumer and infrastructure cases in turn.

Easier said than done: Such an understatement. How do you reconcile business wanting 100% reliability with the realities of day to day failure. That’s what this is about.

Identifying the Risk Tolerance of Consumer Services

Our consumer services often have a product team that acts as the business owner for an application. For example, Search, Google Maps, and Google Docs each have their own product managers. These product managers are charged with understanding the users and the business, and for shaping the product for success in the marketplace. When a product team exists, that team is usually the best resource to discuss the reliability requirements for a service. In the absence of a dedicated product team, the engineers building the system often play this role either knowingly or unknowingly.

If the idea of employees doing something that’s not their job, sounds familiar, that’s because it’s rife through our whole industry. Backend engineers fixing javascript bugs, QA engineers doing requirements gathering, salespeople ending up being scrum-masters.

These aren’t necessarily dysfunctions (not in the way we discuss earlier in the book about development/devops/SRE dysfunctions), but it does depend on the skill-set of the individuals.

Don’t be afraid to approach this procedure just because your job title is wrong: You can be objective and work out what reliability requirements are necessary. Make sure you document your rationale.

There are many factors to consider when assessing the risk tolerance of services, such as the following:

  • What level of availability is required?
  • Do different types of failures have different effects on the service?
  • How can we use the service cost to help locate a service on the risk continuum?
  • What other service metrics are important to take into account?

Each of these points will be addressed below:

Target level of availability

The target level of availability for a given Google service usually depends on the function it provides and how the service is positioned in the marketplace. The following list includes issues to consider:

  • What level of service will the users expect?
  • Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?
  • Is this a paid service, or is it free?
  • If there are competitors in the marketplace, what level of service do those competitors provide?
  • Is this service targeted at consumers, or at enterprises?

There are two opposite ends of a spectrum. One end is “The system is to be always up, 100% reliability” — which we know is unworkable, and if we see this attitude we need to attempt to adjust it.

At the other end is complacency: A refusal to even measure how available the system is. This is worse than having a large error budget: it’s having an unknowable error budget.

So measuring the right thing is the first step, then working out on that scale what our users expect is step two.

Consider the requirements of Google Apps for Work. The majority of its users are enterprise users, some large and some small. These enterprises depend on Google Apps for Work services (e.g., Gmail, Calendar, Drive, Docs) to provide tools that enable their employees to perform their daily work. Stated another way, an outage for a Google Apps for Work service is an outage not only for Google, but also for all the enterprises that critically depend on us. For a typical Google Apps for Work service, we might set an external quarterly availability target of 99.9%, and back this target with a stronger internal availability target and a contract that stipulates penalties if we fail to deliver to the external target.

This raises an interesting issue: and in this article and in future ones I will be using these acronyms consistently:

I define an SLA, or “Service Level Agreement” to be an “external availability target with a contract that stipulates penalties if we fail to deliver to the external target”

And an SLO or “Service Level Objective” to be “internal availability target”, with no contractual obligations.

Neither of these can exist without an SLI, or “Service Level Indicator”, which is the metric of how to define success or failure.

YouTube provides a contrasting set of considerations. When Google acquired YouTube, we had to decide on the appropriate availability target for the website. In 2006, YouTube was focused on consumers and was in a very different phase of its business lifecycle than Google was at the time. While YouTube already had a great product, it was still changing and growing rapidly. We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.

And now YouTube is such a massive product while we probably don’t have very interesting SLAs on YouTube uptime: We would certainly have reasonably strict SLOs internally to make sure we’re pushing videos (and Ads!) to our users reliably.

Types of failures

The expected shape of failures for a given service is another important consideration. How resilient is our business to service downtime? Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.

An illustrative example of the difference between full and partial outages naturally arises in systems that serve private information. Consider a contact management application, and the difference between intermittent failures that cause profile pictures to fail to render, versus a failure case that results in a user’s private contacts being shown to another user. The first case is clearly a poor user experience, and SREs would work to remediate the problem quickly. In the second case, however, the risk of exposing private data could easily undermine basic user trust in a significant way. As a result, taking down the service entirely would be appropriate during the debugging and potential clean-up phase for the second case.

Privacy incidents are no joke.

At the other end of services offered by Google, it is sometimes acceptable to have regular outages during maintenance windows. A number of years ago, the Ads Frontend used to be one such service. It is used by advertisers and website publishers to set up, configure, run, and monitor their advertising campaigns. Because most of this work takes place during normal business hours, we determined that occasional, regular, scheduled outages in the form of maintenance windows would be acceptable, and we counted these scheduled outages as planned downtime, not unplanned downtime.

I used to run Ads Frontend! That was my first SRE team. I joined when it was at the tail end of when we had these planned outages. Let me flesh this out a bit:

The maintenance being conducted here was for moving where the database master replica was running from, either so we could prove that it could be moved, or so that datacenter maintenance could be done.

Note: this was a MySQL database, with 1 authoritative master, with geographically distributed read-only replicas.

As of late 2011, the master failover procedure had gotten so fast, that the outage instead of being an hour and conducted as a planned outage (i.e. disable the system and replace it with a ‘sorry’ page, move the master, bring the system back), we could just move the location of the master and suffer elevated error rates and latency for 5 or so minutes, after which everything went back to normal.

Most of the errors were on ‘write’ operations because reads were always done with the database replicas, not the current write-master.

Once we moved to the newer procedure, we started counting all failures and latency during these events against our error budget. Our users were happier because in aggregate the system was up for longer, and web systems can be quite resilient to transient back-end errors.

Cost

Cost is often the key factor in determining the appropriate availability target for a service. Ads is in a particularly good position to make this trade-off because request successes and failures can be directly translated into revenue gained or lost. In determining the availability target for each service, we ask questions such as:

  • If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
  • Does this additional revenue offset the cost of reaching that level of reliability?

So simple. Translate reliability into numbers, compare those numbers. So complex to get right, except with the simplest systems.

To make this trade-off equation more concrete, consider the following cost/benefit for an example service where each request has equal value:

  • Proposed improvement in availability target: 99.9% → 99.99%
  • Proposed increase in availability: 0.09%
  • Service revenue: $1M
  • Value of improved availability: $1M * 0.0009 = $900

To assist with the thought exercise: 99.9% is 8 hours downtime a year, 40 minutes a month, or 1 in every 1000 user action being an error.

In this case, if the cost of improving availability by one nine is less than $900, it is worth the investment. If the cost is greater than $900, the costs will exceed the projected increase in revenue.

If you think you can spend $900 and bring a non-trivial user facing system from an expected 40 minutes of downtime a month to 4 minutes a month, I am very interested, please leave a comment. And your résumé.

It may be harder to set these targets when we do not have a simple translation function between reliability and revenue. One useful strategy may be to consider the background error rate of ISPs on the Internet. If failures are being measured from the end-user perspective and it is possible to drive the error rate for the service below the background error rate, those errors will fall within the noise for a given user’s Internet connection. While there are significant differences between ISPs and protocols (e.g., TCP versus UDP, IPv4 versus IPv6), we’ve measured the typical background error rate for ISPs as falling between 0.01% and 1%.

Other service metrics

Examining the risk tolerance of services in relation to metrics besides availability is often fruitful. Understanding which metrics are important and which metrics aren’t important provides us with degrees of freedom when attempting to take thoughtful risks.

I don’t think this chapter has adequately explained and defined what ‘Availability’ means. But I think that’s fine, considering it takes me months of meetings to get a room full of people to agree on what it means.

You could likely write entire books on the topic of defining Service Level indicators.

Service latency for our Ads systems provides an illustrative example. When Google first launched Web Search, one of the service’s key distinguishing features was speed. When we introduced AdWords, which displays advertisements next to search results, a key requirement of the system was that the ads should not slow down the search experience. This requirement has driven the engineering goals in each generation of AdWords systems and is treated as an invariant.

When you think about the fact that AdWords was backed by a MySQL database for over a decade, you have to stop and think about the engineering effort that went into this.

Clearly serving AdWords ads directly out of the MySQL database is not a thing that will scale to the number of users that Google Search has.

AdSense, Google’s ads system that serves contextual ads in response to requests from JavaScript code that publishers insert into their websites, has a very different latency goal. The latency goal for AdSense is to avoid slowing down the rendering of the third-party page when inserting contextual ads. The specific latency target, then, is dependent on the speed at which a given publisher’s page renders. This means that AdSense ads can generally be served hundreds of milliseconds slower than AdWords ads.

Not to say that AdSense is slow: hundreds of milliseconds is actually very fast for page asset loading. And these latency goals were originally set when the 56k modem was still in widespread use.

This is just a useful comparison for where a user might still have a ‘good’ experience with one system being much slower than the other. Both in target and actual performance.

This looser serving latency requirement has allowed us to make many smart trade-offs in provisioning (i.e., determining the quantity and locations of serving resources we use), which save us substantial cost over naive provisioning. In other words, given the relative insensitivity of the AdSense service to moderate changes in latency performance, we are able to consolidate serving into fewer geographical locations, reducing our operational overhead.

When users are happy with slower responses/delayed responses or transient-invisible errors, you can make a good argument for much looser error budgets.

Identifying the Risk Tolerance of Infrastructure Services

As I personally didn’t have a great deal to say here about the Risk Tolerance of Infrastructure Services, I’ve elided that section, please click through to the SRE Book site to read the section about estimating risk for a service like Bigtable.

Coming next

The last part of Chapter 3: Embracing risk, titled “Motivation for Error Budgets”. Where we discuss how Error Budgets are used.

You can see all my posts in order here.