Last year, the Uber Engineering team published an article on their new load-shedding mechanism designed for their microservice architecture.
This article is very interesting from various perspectives. So, I took some notes as I was reading it to capture my understanding and write down things that I would like to delve deeper into later if I don’t get the answers by the end. I’ve found multiple times that this is the best way for me to learn new things.
What got me from the start was the reference to century-old principles used to build this solution. That’s something I love — borrowing concepts/ideas from different fields and adapting them to solve a problem in a different domain.
If system resiliency and stability are of interest to you, I recommend also reading the excellent book ‘Release It!’ by Michael T. Nygard.
It’s an oldie but a goodie — a book that delves into strategies, patterns, and practical guidance for building resilient and stable software systems, emphasizing how to handle failures effectively.
Uber has implemented a new load shedding solution called Cinnamon that takes advantage of a PID controller (the centuries-old mechanism) to decide which requests should be processed or discarded by a service based on the service’s current load and the request priority.
It does not involve any tuning at the service level (although I had a question about that), is automatically adaptable, and much more efficient than their previous solution QALM. Remember also that Uber’s microservices architecture is not for the faint of heart…
A PID controller is an instrument used in industrial control applications to regulate temperature, flow, pressure, speed, and other process variables. PID (proportional integral derivative) controllers use a control loop feedback mechanism to control process variables and are the most accurate and stable controllers.
If you want more information about this centuries-old concept, head to Wikipedia.
Now, back to the article. PID stands for Proportional, Integral, and Derivative. In their case, they use a component known as a PID controller to monitor the health of a service (input requests) based on three components (or measures).
The term “proportional” indicates that the action taken is proportional to the current error. In simple terms, this means that the correction applied is directly proportional to the difference between the desired state and the actual state. If the error is large, the corrective action is proportionally large.
When an endpoint is overloaded, the background goroutine starts to monitor the inflow and outflow of requests into the priority queue.
So, the Proportional (P) component in the load shedder adjusts the shedding rate based on how far the current queue size is from the target or desired queue size. If the queue is larger than desired, more shedding occurs; if it’s smaller, shedding is reduced.
That’s my understanding of it.
The PID controller’s job is to minimize the number of queued requests, whereas the auto-tuner’s job is to maximize the throughput of the service, without sacrificing the response latencies (too much).
While the text doesn’t explicitly mention “Integral (I)” in the context of the queue size, it indicates that the PID controller’s role is to minimize the number of queued requests. The minimization of queued requests aligns with the Integral component’s goal of addressing accumulated errors over time.
To determine whether an endpoint is overloaded we keep track of the last time the request queue was empty, and if it has not been emptied in the last say 10 seconds we deem the endpoint to be overloaded (inspired by Facebook).
In the load shedder, it may be associated with decisions related to the historical behavior of the request queue, such as the time since it was last empty.
Honestly, that’s not entirely clear to me. It’s a bit frustrating, I must say. While they mention leveraging a centuries-old mechanism, it would have been helpful if they explicitly stated which part corresponds to what or how it operates. I do not want to diminish the value of their amazing article. That’s just my rant here… After all, I’m French… ;)
I think this one is easier to identify.
In a classical PID (Proportional-Integral-Derivative) controller, the “Derivative (D)” action is particularly useful when you want the controller to anticipate the future behavior of the system based on the current rate of change of the error. It helps dampen oscillations and improve the system’s stability.
In the context of the load shedder and the PID controller mentioned in the article, the Derivative component is likely employed to assess how fast the request queue is filling up. By doing so, it aids in making decisions that aim to maintain a stable system and prevent sudden or unpredictable changes.
The rejector component has two responsibilities: a) figure out whether an endpoint is overloaded and b), if an endpoint is overloaded, shed a percentage of the requests to make sure the request queue is as small as possible. When an endpoint is overloaded, the background goroutine starts to monitor the inflow and outflow of requests into the priority queue. Based on these numbers, it uses a PID controller to determine a ratio of requests to shed. The PID controller is very fast (as in very few iterations are needed) at finding the correct level and once the request queue has been drained, the PID ensures that we only slowly reduce the ratio.
In the mentioned context, the PID controller is used to determine the ratio of requests to shed when an endpoint is overloaded, and it monitors the inflow and outflow of requests. The derivative component of the PID controller, which responds to the rate of change, is implicitly involved in assessing how fast the request queue is being filled or drained. This helps in making dynamic decisions to maintain system stability.
In the context of determining overload, the integral component might be associated with tracking how long the request queue has been in a non-empty state. This aligns with the idea of accumulating the integral of the error signal over time.
“Integral — based on how long the request has been in the queue…”
The derivative component, on the other hand, is related to the rate of change. It responds to how quickly the state of the request queue is changing.
“Derivative — rejection based on how fast the queue is filling up…”
The Integral component emphasizes the duration of the non-empty state, while the Derivative component considers the rate at which the queue is changing.
At the end of the game, they utilize these three measures to determine the course of action for a request.
The question I have is how they combine these three components, if at all. I’m curious to understand how they monitor them as well.
Nevertheless, I think I’ve got the idea…
The endpoint in the edge is annotated with the priority of the request and this is propagated from the edge to all downstream dependencies via Jaeger. By propagating this information, all services in the request chain will know the importance of the request and how critical it is for our users.
The first thought that comes to mind is that it would seamlessly integrate into a service mesh architecture.
I appreciate the concept of employing distributed service tracing and headers to propagate request priority. Along these lines, why opt for a shared library with this dependency added to each microservice, instead of placing it outside the service, perhaps as an Istio plugin? Considering the benefits it offers: independent release/deployment cycles, polyglot support, etc.
Here are some additional thoughts:
Well, I’m biased, as I’m not a big fan of shared libraries, if only because I think they complicate the release/deployment process. However, I’m not sure if there is a service-specific configuration aspect to consider. Perhaps they configure how long the service should wait to start processing a query and complete it?
Perhaps one aspect worth testing is the decision-making process of the ejector.
From what I understand, it determines whether to reject a request based on the PID controller, which is localized to the service. Is there an option for a more global approach? For instance, if it’s known that one of the downstream services in the pipeline is overloaded (due to its own PID controller), could any upstream service decide to reject the request before it reaches this overloaded service (which could be n steps further down the path)?
This decision could be based on the value returned by the PID controller or the auto-tuner of the downstream service.
Now, I’m musing on various aspects mentioned as they wrap up the article and provide some numbers to showcase the efficiency of their system, which is quite impressive
They mention at some point that ‘Each request has a timeout of 1 second.’
We run tests of 5 minutes, where we send a fixed amount of RPS (e.g., 1,000), where 50% of the traffic is tier 1 and 50% is tier 5. Each request has a timeout of 1 second.
It is common in distributed systems to associate a request with a specific expiration time or deadline, with each service along the processing path responsible for enforcing this time limit. If the expiration time is reached before the request is completed, any service in the chain has the option to abort or reject the request.
I assume this 1-second timeout is attached to the request, and each service, depending on where we are in this deadline, can decide to abort the request. This is a measure that is global because it is aggregated through the services. I think it resonates with the point I was making earlier about having a global view of the full system health or dependencies to decide to abort the request as soon as possible if it does not have a chance to complete due to one of the services down the path.
Could the ‘health’ of the downstream services (comprising data from their local PID controllers) be returned as headers attached to the responses and used to build a more evolved circuit breaker/early preemptive shedding mechanism?
Finally, I’m curious to learn more about the previous approach because, based on the description given in this paper, it seems to be sound.
When you examine the measures of goodput and latencies, there is no doubt about which one, QALM or Cinnamon, performs the best. Note that they mention a link to the QALM approach in the article. Probably should start from there ;)
As always, these approaches are not for everybody. Uber’s architecture and load are kind of its own. I’m actually impatient to read the next articles in this series, specifically to learn more about the PID controller and auto-tuner.
With Cinnamon we have built an efficient load shedder that uses century-old techniques to dynamically set thresholds for rejecting and estimating capacity of the services. It solves the issues we noticed with QALM (and thus any CoDel-based load shedder), namely, that Cinnamon is able to:
- Quickly find a stable rejection rate
- Automatically adjust the capacity of the service
- Be used without setting any configuration parameters
- Incur very low overhead
What is interesting about this approach is that they consider all the requests to be processed to decide what to do for each new input request, as they use a (priority) queue. As mentioned, I’m curious if the mechanism could also take into account the health of all the dependent services based on the same PID measures…
There are other interesting aspects in this article, such as how they measure the effect of their strategies and the comparison with the previous approach. However, it does not require more detailed notes from me than what is already presented. So, I highly encourage you to read the original article.
Found this article useful? Follow me on Linkedin, Hackernoon, and Medium! Please 👏 this article to share it!
Also published here.