When you start exploring the world of distributed systems, you start to see the name “CAP theorem”, buzzing here and there. Whether you're familiar with the overall concept, need to refresh your knowledge the night before your system design interview, or are completely new to it, this article aims to help you. Before delving into the theorem itself, let's first review some basic concepts to provide some context.
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another -
In general, a distributed system refers to a collection of nodes that communicate with each other over a network, using various protocols and approaches like RPC calls or messaging systems, to achieve a common objective.
Consider, for instance, a food delivery application that lets you order meals from various restaurants. Why is such a system necessary? Why can't it simply rely on a single server? There are several factors to consider, but two key concepts are scalability and reliability.
Scalability is a term that describes a system’s ability to cope with increased load. (source)
It means that on one hand it’s an ability to handle the increased load by adding more resources to the system, but on another hand it decreased used resources, to cut the costs, when it’s not required.
Here I should mention that there are two types of scaling - vertical and horizontal.
Vertical scaling refers to system manipulation with resources on one machine. Whereas, horizontal scaling refers to the addition of more nodes that have the same functionality as the load-targeted node.
In distributed systems, we lean more toward horizontal scaling. To illustrate a use case for this principle, we can refer to the above-mentioned service of food delivery. You start by noting the fact that load is not distributed proportionally during the day (you probably have the least number of orders at 4 AM and the peak hour will be probably lunch or dinner).
Let’s say you only need X amount of resources during the night and 20X during the peak hour. You definitely don’t want to pay for 20X resources during the night, and vice versa, you won’t be able to process the peak hour requests by having only X resources, which are sufficient during the night. To solve this problem, you should scale your resources, in accordance with the load, which, with horizontal scaling, means adding or removing the nodes of the service.
Reliability is defined as the probability that a component of a system will perform a required function for a given period of time when used under operating conditions (Source).
In simple words, it’s the ability to continue to operate even when facing different sorts of failures. As an example, imagine that you are hosting in one data center, which for some physical reason has connectivity problems. In that case, you won’t be able to operate in general. To avoid such a problem you probably would like to be hosted in multiple data centers. There might be different types of errors and ways to solve them, but the general goal is to keep your system operating as expected for the end user, even during times of failure.
Understanding these concepts, use cases, and examples related to them, allows us to realize that the vast majority of today’s service has to exist as a distributed system. We have covered the advantages of distributed systems, but what about the limitations? The idea of the CAP defines those limitations.
The CAP theorem was invented by computer scientist Eric Brewer in 1998, and he introduced it at the “Symposium on Principles of Distributed Computing” later in 2000 -
To understand the theorem and the formulation in general, we first need to look at the three components of the CAP acronym.
In a replicated system, consistency means that whenever a read operation is performed on any node, it will always return the same data. For example, if you have three nodes in your database - x, y, and z - and a write operation is executed on node x, there should be no time gap between when this write affects the read result from node x and when it affects the read results from nodes y and z.
Availability, on the other hand, guarantees that the system functions as expected and that any request will receive a non-error response. This doesn't necessarily mean that there won't be any logical errors or internal failures, but rather that there won't be any error responses caused by communication failures between nodes in the database.
Lastly, partition tolerance refers to the system's ability to keep processing requests even if a node fails or its connection to the network is lost. Network partition is something that's bound to happen in the real world, especially as systems become more complex and distributed. In fact, the more complex your system becomes, the more frequent network failures are likely to occur.
Any distributed storage can provide not more than 2 properties from the list of consistency, availability, and partition tolerance. As a result, any distributed store might be consistent and available (CA), consistent and partition tolerant (CP), or available and partition tolerant (AP).
Let’s approach our proof with a contradiction: let’s assume that the distributed store can be consistent, available, and partition tolerant at the same time. We have storage with a replica set of three nodes. The network connection between node 3 and other nodes is broken
( partitioned ) and node 1 receives a write request. As long as we are assuming that our system is partition tolerant - it shouldn’t be a problem, and node 3 should still be able to operate and perform read operations.
However, we assume that our system is both consistent and available, but our response can’t be consistent, because we haven’t received the latest write request, handled by node 1 and if we will respond to the user with some data it won’t be the latest data in the system, which means, that our system is not consistent.
On the other hand, if we want to wait until the network partition is solved and we’re able to receive the latest data from node 1 and become consistent, we won’t meet the availability that we assume our system has attained. This means that our system can’t be partition tolerant, consistent, and available at the same time.
Now we need to prove that the system can indeed be CA, CP, or AP. For all three situations, we will be looking at the same schema.
For a CP system, assuming that it's both consistent and partition tolerant, we can still continue operating despite a network failure with node 3. However, in order to maintain consistency, node 3 won't be available for reads until its connection is restored and it has received the latest data.
In contrast, for an AP system that's both available and partition tolerant, we can continue to operate even during a network partition. Although node 3 may not have the latest data, it can still respond with the data it has, albeit inconsistently. This ensures that the read calls are still available.
For a CA system, we don't need to worry about network partitions because we don't guarantee a working cluster in such a situation. Therefore, we only need to ensure that node 3 can respond with the latest data without delay. To accomplish this, node 1, which accepts the writing, will ensure that the latest write is provided to other cluster nodes before responding to the writer. This can be done easily since we don't have any network partitions to contend with.
As you have already noticed, the CAP theorem is a fundamental concept that shows the tradeoffs that you have to make when working with distributed systems. However, as we have seen, in fact, you have to choose between consistency and availability, because network partition is something you have to be prepared to handle. But to make a proper decision, you need a deep understanding of the problem you want to solve, precise requirements, and a vision for the foreseeable future, otherwise, the consequences might ruin all your work.
For example, if you are building a system related to finance, it would be crucial to keep data as consistent as possible, otherwise, the discrepancy and decisions based on it may result in a tremendous financial loss. On the other hand, if you are designing a system, which will serve as a user profile page on some social media, it’s not so crucial to always have the latest update, which is consistent, and availability is a preferred way to design the system.
As you might have noticed, the CAP theorem was introduced back in 2000, which was relatively a long time ago and since then there has been a bunch of criticism. Truly, the approach to solving problems in distributed systems has progressed and more concepts have appeared, which allow us to give a deeper look into these concepts. To learn more about it, I highly recommend exploring the PACELC theorem.
Overall, the CAP theorem is providing a relevant view and a way to think about the system design of distributed systems, which forces you to think about the tradeoffs and is a great way to start exploring the world of distributed systems.