We are aware of the age-old adage that failure is the best form of learning. This is also applicable to the design of highly resilient large-scale distributed apps. The most resilient apps are built after years of learning from system failures that have led to several design changes, fine-tuning of the system architecture, and optimization of processes. It is crucial for developers to remain proactive and not wait for failures to occur. Developers should always be on the lookout for systemic problems that could one day cause an end-user impacting outage. During the initial design of the system itself, it is recommended that certain best practices and design patterns are considered based on expected estimates of traffic numbers and usage patterns. This article touches upon two common design patterns for building resilient systems and some tips to implement them in practice.
Distributed apps generally fail due to hardware errors e.g., network connectivity, server failure, power outage, etc., or application errors e.g., functional bugs, poor exception handling, memory leaks, etc. For the purposes of this article, we aggregate these errors into two different categories to give us a different view:
Transient errors: These errors are expected to go away in a short time e.g., network issues.
Non-Transient errors: These errors are not expected to go away in a short time e.g., complete hardware failure.
Next, we talk about approaches to handle these errors.
Retries are best suited for transient errors as they usually do not reduce traffic being sent to the service experiencing issues by much. Typically, the following retry strategies are followed:
Circuit breakers are best suited for non-transient errors. They offer a smarter way for an application to wait for a service to become healthy without wasting CPU resources. The biggest benefit of a circuit breaker is that it rate-limits traffic to a failing service increases the chances of it become healthy sooner. Circuit breakers maintain a count of recent failures for a given operation within a time window and allow new requests to go through only if this number is below a threshold. They are generally modeled as a state machine with the following states:
Open: Do not allow any requests to go through and return a failure.
Half-Open: Allow some requests to go through but if the number of failing requests crosses the threshold, the circuit moves to the open state.
Closed: This is the healthy state and requests are allowed to go through. The circuit breaker monitors the count of failed requests over a time window to decide whether to move the circuit to a half-open state.
Here are some ways to configure circuit breakers:
The best way to select a configuration is to understand the reasons for service failure and pick the simplest option. Understanding recovery patterns of the failing service play an important role in configuring thresholds e.g., avoid staying in the open state for too long even after the service has become healthy.
The major problem with circuit breakers and retries is that if the failing service does not become healthy quickly there is no way out leading to poor user experience and potential loss of business. It is important to investigate some mechanisms that can be used to offer some functionalities to the user in such cases. A fallback is one such mechanism. There are various types of fallback mechanisms:
“Kiddo” service: This funny-sounding option offers a set of very basic functionalities and can be switched on in case of a major outage to the live service. E.g. if a taxi booking service fails completely, a “Kiddo” service could kick in that allows users to book only one type of taxi. Ideally, the “Kiddo” is hosted in a separate region or a separate data center.
Constants: We all hate constants in code but having a set of local defaults for values that are retrieved from a remote service can save the day when the remote service has been down for several hours.
Cached responses: This option caches some previous successful responses from the remote service and uses them when needed.
Chaos testing is a good way to test retry and circuit breaker configurations. These configurations evolve over time due to learnings from failures. Chaos testing makes it possible to mimic those failures and design reliable retry and circuit breaker configurations.
We have talked about both circuit breakers and retries in detail but there is still a lot to learn about them. I recommend choosing one or the other or a combination of both depending on the failure patterns, application SLOs, traffic numbers, and usage patterns.