Distributed computing is a complex field with numerous challenges, and understanding the fallacies associated with it is crucial for building robust and reliable distributed systems. Here are eight fallacies of distributed computing and their significance:
Understanding these fallacies is critical because they underscore the challenges and complexities of distributed computing. Failure to account for these fallacies can lead to system failures, security breaches, and increased operational costs. Building reliable, efficient, and secure distributed systems requires a deep understanding of these fallacies and the implementation of appropriate software design and architecture and IT operational strategies to address them.
In this blog post, we will look at the first fallacy, its impact on microservices architecture and how to circumvent this limitation. Let’s say we’re using spring-boot to write our microservice and it uses MongoDB as the backend, which is deployed as a StatefulSet in Kubernetes. and were running this on EKS. You might also question that it is your cloud provider’s job to give us a reliable network and that we’re paying them for high availability. While the expectation might not be wrong, unfortunately, it doesn’t always work as expected when you rent hardware over the cloud. Let’s say your cloud provider promises 99.99% availability, which is impressive right? No, it ain’t! and I’ll explain how. 99.99% availability could lead to
Now, you might say that my system doesn’t get that kind of traffic! Fair enough, but this is availability data of the cloud provider, not your instances of service, which means that if that cloud is getting a billion network requests within its network, 1,00,000 will fail! To make things more complex, you can’t expect them to distribute these failures across all accounts using their hardware; you might get hit by any number of those failures depending on your luck. The question here is, do you want to run a business just on the chance of these outages not hitting you? I hope not! So here’s the fundamental description of the first (and most critical) fallacy of distributed computing.
Let’s take an example of an e-commerce system; we’d usually see a product catalog from the Product Microservice; however, the SKU availability might be fetched from another microservice when building the Product Catalogue response. One could argue, though, that I can replicate the SKU information into the product catalog via Choreography, but for the scope of this example, let’s assume that’s not in place. The product service, therefore, is making a REST API Call to the SKU service. What happens when this call fails? How would you convey to the end user whether the product they are looking at is available or not?
Scary stuff, yeah? Well, it is not so scary; since we love to brave ourselves on harder frontiers as engineers, we have a few tricks up our sleeves.
It’s a topic worthy of a book, perhaps in itself, instead of a blog post. But I’ll try to cover all I can while keeping it simple. Most of what I’m sharing here are experiences gathered over transitioning from Monolith to Microservices for the SaaS business in NimbleWork. And I hope others find it helpful too.
The following patterns help circumvent transitory outages or blips as we usually call them. The fundamental underlying assumption is that such outages have a lifetime of a second to two at worst.
One of the simplest things to do is to wrap your network calls in a retry logic so that there are multiple tries before the calling service finally gives up. The idea here is that a temporary network snag from the cloud provider wouldn’t last longer than the retries made for fetching data.
Microservice libraries and frameworks in almost all common programming languages provide this feature. The retires themselves have to be nuanced or discrete; retrying when you get 400 will not change the output for example until the request signature is changed. Here’s an example of using retries when making REST API calls with Spring WebFlux WebClient.
Here’s a summary of what we’re trying to achieve with this piece of code:
jitter
504
, 503
or 502
statuses
These retries can help recover from blips or snags that aren’t expected to last long. This can also be a good mechanism if the upstream service we’re calling restarts for whatever reason.
Note: We’ve noticed running replica sets in Kubernetes with Rolling Updates strategy helps reduce such blips and, hence retries.
While this is an example using the Reactor Project’s implementation in Spring; all major frameworks and languages provide alternatives
scala.util.{Failure, Try}
if you’re using Scala without any framework as such
And I’m sure this is not an exhaustive list. This pattern takes care of transitory network blips. But what if there’s a sustained outage? More than later in this article.
What if the called service continuously crashes and all retries from various clients exhaust? I prefer falling back to a last-known-good version. There are a couple of strategies that can enable this last-known-good-version
policy on the infrastructure and client-side. And we’ll briefly touch upon each one of them.
Caching at downstream The browser, or any client for that matter, continuously writes data to an in-memory store until it receives a heartbeat from the upstream. This mechanism offers various implementations for both UI and headless clients that make service calls through gRPC or REST. Here is a summary of what to do, regardless of the type of client.
We’d be lucky if any distributed architecture were a system of only blips, which they are not. Hence, we have to build our systems to handle long-lived outages. Here are some of the patterns that help in this regard.
Bulkheads address the contingency of outages caused by slow upstream services. While the ideal solution is to fix the upstream issue, it’s not always feasible. Consider a scenario where the service (X) you’re calling relies on another service (Y) that exhibits sluggish response times. If service (X) experiences a high volume of incoming traffic, a significant portion of its threads may be left waiting for the slower upstream service (Y) to respond. This waiting not only slows down the service (X) but also increases the rate of dropped requests, leading to more client retries and exacerbating the bottleneck.
To mitigate this issue, one effective approach is to localize the impact of failures. For instance, you can create a dedicated thread pool with a limited number of threads for calling the slower service. By doing so, you confine the effects of slowness and timeouts to a specific API call, thereby enhancing the overall service throughput.
Circuit Breakers can easily be avoided, we have to write services that will never go down! However, the reality is that our applications often rely on external services developed by others. In these situations, Circuit Breaker, as a pattern, becomes invaluable. It routes all traffic between services through a proxy, which promptly starts rejecting requests once a defined threshold of failures is reached. This pattern proves particularly useful during prolonged network outages in external services, which could otherwise lead to outages in the calling services. Nevertheless, ensuring a seamless user experience in such scenarios is vital, and we’ve found two approaches to be effective:
The realization that, despite a cloud provider’s commitment to high availability, network failures remain an inevitability due to the vast scale and unpredictable nature of these networks, underscores the critical need for resilient systems. This journey immerses us in the realm of distributed computing, challenging us as engineers to arm ourselves with strategies for fault tolerance and resilience. Employing techniques like retries, last-known-good version policies, and the development of separate client-server architectures with state management on both ends equips us to confront the unpredictability of network outages.
As we navigate the intricacies of distributed systems, the adoption of these strategies becomes imperative to ensure smooth user experiences and system stability. Welcome to the world of Microservices in the Cloud, where challenges inspire innovation, and resilience forms the bedrock of our response to unreliable networks.
Also published here.