When learning about blockchain consensus algorithms and distributed systems in general, you will inevitably come across terms like FLP impossibility and Byzantine fault tolerance. While there is plenty of literature on these subjects, it often suffers from a narrow focus, failing to explain the connections and relationships between them. Furthermore, much of the existing literature gives either too much or not enough technical detail — I found this to be especially true when learning about consensus algorithms like the proof of stake.
This miniseries is an attempt to strike that balance of detail for beginner to intermediate readers who know the What and are trying to understand the Why and How. We will try to “connect the dots” by explaining the theoretical background of consensus agreement in distributed systems (part 1) and how it applies and relates to different algorithms used in the blockchain (part 2).
We assume the readers know what a consensus algorithm is and why it is needed in a distributed system. If a consensus algorithm is to be reliable, it must satisfy the following properties:
- safety (also called agreement) — nodes will agree on the same value, which was proposed by one of the nodes.
- liveness (also called termination) — the algorithm is guaranteed to make progress towards the outcome and under no conditions will get stale. In other words, the nodes will eventually reach an agreement.
- fault tolerance — the algorithm guarantees that even in case of a node failure, all of the healthy nodes will still reach consensus.
To design an algorithm that is fault tolerant, we must first understand what exactly a node failure is and how it manifests in terms of the node’s behavior. Here is a non-exhaustive list of some of the main types of failures in distributed systems:
- crash failures (sometimes described as fail-stop failures, there is a minor distinction which we’ll leave out) — node comes to halt, does not resume. This means no messages are sent or received and no progression in execution is made. This is the simplest type of fault, easier to deal with as the behavior of the node is well defined and very limited (on/off).
- omission failures — node omits to send or receive a message. The state is preserved and the node can subsequently resume correct behavior. If every message gets omitted, we get a crash failure; vice versa, if a crashed node is restarted (and recovers its state), we get an omission failure. Thus we say that omission failures subsume crash failures: a consensus algorithm that is tolerant to omission failures is by default tolerant to crash failures. Vice versa, an algorithm that is not crash-tolerant is neither omission-tolerant. However, be aware that a scenario where a crashed node does not recover its state upon restart (so-called amnesia crash) does not fall into the omission category as the state is not preserved!
- Byzantine failures — node responds with arbitrary messages, or not respond at all. This can be because of faults in the system (hardware, software bugs, etc.) or because of malicious intent. Byzantine failures are hardest to deal with because the behavior is non-deterministic and the node might appear as healthy to the rest of the network. This category subsumes the previous two (some or all messages might be omitted).
Lastly and equally imperative, we must have a clear understanding of the underlying infrastructure and the constraints it puts on the system (e.g. communication and message processing delays, clock synchronization, etc.). There are several models of distributed systems that pose different assumptions on the behavior of the infrastructure; to keep things simple, we will introduce only two fundamental ones:
- In a synchronous model, there is a known upper bound on the delay for transmitting and processing a message (in other words, the transmission and processing are guaranteed to finish within a known time Δ). This is an important assumption that allows us to detect crash failures (albeit not Byzantine failures): if the time taken to send and process a message exceeds the known bound, the node can be considered faulty.
- In an asynchronous model, there is no upper bound on the delay for transmitting or processing a message; while the transmission and processing are guaranteed to eventually finish (i.e. are finite), it can take arbitrarily long. This complicates things for us as we are not able to detect crash failures: if a node is not responding, we have no way of knowing whether it crashed or whether it is healthy but the processing/transmission simply takes so long. Notice that asynchronous model subsumes the synchronous one, meaning consensus algorithms that work in asynchronous model work in synchronous too; vice versa, algorithms that do not work in the synchronous model do not work in asynchronous either.
It becomes apparent that an asynchronous model with byzantine failures presents the weakest set of assumptions from all the possible combinations we listed. In fact, it is the weakest set of assumptions, period — it subsumes any other model and failure type. Unfortunately, finding an algorithm for such conditions is also the most difficult as we have to account for non-deterministic behavior and are not able to reliably detect even crash failures. Indeed, as we will shortly see, finding such algorithm turns out to be completely impossible.
In 1985, Michael Fischer, Nancy Lynch and Michael Paterson (hence FLP) published a seminal paper, which became known as the FLP Impossibility theorem. The paper lays down a proof of the following statement:
In an asynchronous model with crash failures, it is impossible to find a consensus algorithm that would satisfy safety, liveness and fault tolerance.
In other words, such algorithm does not exist. Key things to highlight here:
- the paper contains a formal proof of the above statement. This brings significant implications (some of which we will soon discover) and had become a major milestone in the field of distributed systems.
- the paper explicitly specifies asynchronous model with crash failures. However, from the previous section we know that since omission and byzantine failures subsume crash failures, the statement applies to both omission and byzantine failures by default.
- it is enough for just one node to be potentially faulty for the statement to hold true. In other words, even if we make the assumption that at maximum 1 node can fail, it is still impossible to find a consensus algorithm that would be tolerant to that possible failure while achieving safety and liveness. This is because to achieve fault tolerance, the algorithm has to be able to reliably detect the failure, which is simply impossible in an asynchronous model — regardless of how many nodes can be affected!
So what can we do? Although the situation might seem hopeless at first, it is not the end of the world. Here are our options:
- extend/change the model. Apart from the synchronous model, which we explained in the previous section, there are several other solutions (randomized algorithms, partially synchronous models, unreliable failure detectors), all of which extend the asynchronous model in some way (we won’t explore these further as that would require series of its own).
- loosen the algorithm requirements. For example, instead of all healthy nodes having to agree on a value, it might be enough if the majority agrees (safety). Or we accept that the algorithm might not always terminate (liveness), etc.
Extending the model means we pose some additional assumptions (like the upper bound for delay in the synchronous model). As long as we are guaranteed these assumptions will be met, we have no issue — a reliable consensus algorithm exists. The problem is that in the real world, most systems are intrinsically asynchronous, meaning the assumptions we make using some extended model will not always hold true. Let’s take the synchronous model as an example: in most cases, the delay for message delivery and processing will fit within the model’s bound (the higher the bound the lower the chance the delay will exceed it). But in an asynchronous system, the probability of the delay exceeding the bound will always be > 0, no matter the size of the bound. Thus, we cannot guarantee the correctness of the outcome in every case — the algorithm is not 100% reliable.
So what happens if the delay exceeds the bound? First of all, if the reason for that happening is a node failure, the algorithm is still correct — remember safety and fault tolerance (all healthy nodes will reach an agreement). Only if a healthy node does not respond within the set bound, the algorithm loses one of its properties — in this case safety, since liveness is guaranteed by using a timeout for failure detection.
Example: 2-Phase Commit
To better illustrate, we will use the 2-Phase Commit algorithm as an example. 2-Phase Commit is an atomic commitment protocol used in distributed databases, serving two functions:
- agreeing whether to commit or rollback a transaction (consensus) and
- executing the decided action (commit/rollback).
It is important to realize that reaching consensus is only one part of 2-Phase Commit. Even more important, however, is the fact that 2-Phase Commit is NOT a reliable protocol. As we will shortly demonstrate, it does not satisfy the 3 properties even in a synchronous model. With that said, here is the complete message flow (credit Wikipedia):
Coordinator Participants 1. QUERY TO COMMIT --------------------------------> 2. VOTE YES/NO prepare*/abort* <-------------------------------- commit*/abort* 3. COMMIT/ROLLBACK --------------------------------> 4. ACKNOWLEDGMENT commit*/abort* <-------------------------------- end
- One of the nodes (called the Coordinator) sends a Query to commit message to all the other nodes (called Participants).
- Each participant responds with Yes or No.
- If every participant responds with Yes, the coordinator sends a Commit message to all participants; if at least one No was received or some participant failed to respond, Rollback is sent instead. Once this step is finished, consensus is reached.
- Participants execute the decided action and respond with ACK. This step is part of the execution phase and outside of consensus agreement.
Notice in step 3 that Rollback is sent if some participant fails to respond. This is detected by using a timeout. Let’s look at all possible failures and how different models and different implementation decisions affect the outcome of the algorithm.
1. Coordinator fails to send Query to commit
Coordinator Participants 1. QUERY TO COMMIT ------> X
- If the coordinator fails before sending any Query to commit, there is no issue with regard to consensus — the algorithm has not commenced yet (it starts with the first Query to commit). The participants will be blocked until the coordinator recovers. The problem here is purely practical: waiting for the coordinator could severely impact performance since the transaction could have locked some resources. To prevent that, we can use a timeout and rollback, however note that there is no upper bound for us to use as the timeout even in a synchronous model! This is because, in a generic database system, there is no upper limit on the length of the transaction — it is dependent on the input from the client and can be arbitrarily long. Thus, we’re not able to distinguish between a crashed coordinator and a long-running transaction. While using a timeout would be safe in terms of all nodes reaching consensus (once the coordinator recovers, participants that already rolled back simply respond with No to Query to commit), there would always be some risk of rolling back an ongoing transaction unnecessarily. It is nonetheless still worth considering as it should be fairly small (indeed, some databases give you the option to configure a transaction timeout).
- If the coordinator fails after it already sent Query to commit to some of the participants, the above point does not pertain to the participants that received the query. For these participants, see case number 3 below.
2. Participant fails to respond to Query to commit
Coordinator Participants 1. QUERY TO COMMIT --------------------------------> 2. ? X <------- 3. ROLLBACK -------------------------------->
- In a synchronous system, we can use the upper bound on the delay as a timeout to detect the failure and rollback the transaction. If the failed participant subsequently recovers and sends its vote (an omission failure), the coordinator will simply respond with the rollback.
- In an asynchronous system, there is a chance the participant did not fail and is simply taking too long to respond. Similarily to the previous cases, it is safe to use a timeout and rollback the transaction. If the participant that appeared faulty was in fact healthy, the transaction will be rolled back unnecessarily, however consensus will be reached and data integrity protected.
3. Coordinator fails to send Commit/Rollback
Coordinator Participants 1. QUERY TO COMMIT --------------------------------> 2. YES <-------------------------------- 3. ? ------> X
- Here we have a problem. Imagine the coordinator sends Commit to half of the participants, then crashes. If the remaining participants use a timeout and rollback (even in a synchronous model), we will get a group of participants that will commit and a group that will rollback, breaking safety and thus data integrity. Without the timeout, the participants will be blocked until the coordinator recovers, breaking liveness (this is, in fact, the chosen approach in transactional databases where consistency is more important than availability).
4. Participant fails after it receives Commit/Rollback
- Failure at this point has no effect on consensus — all healthy nodes receive the value from the coordinator independently. Of course, failure here would affect the execution phase as the decided action would not be carried out by the failed participant. This, however, cannot be prevented and the only thing we can do is to detect the failure (final ACK message) and run a recovery process (in most cases automatic).
By now you might be asking, why would anybody choose such an imperfect protocol? First of all, thanks to FLP we know that in the real world with intrinsically asynchronous systems, we will never get a perfect reliable consensus algorithm. There will always be some kind of risk. As with everything in tech, it comes down to tradeoffs — different solutions yield different characteristics in terms of performance, which properties the algorithm loses, to what degree, under what circumstances and probabilities. The 2-Phase Commit is a good example of such tradeoff: in exchange for low message complexity (under ideal circumstances 3 message delays to commit/rollback) we accept some risk of losing liveness.
With that, we’re ready to move on and complicate things a bit.
So far, we have been dealing solely with crash and omission failures with limited behavior — node is either correct or not responding. Unfortunately, in the real world, we often cannot rely on such predictable behavior. There might be intermittent network outages, hardware malfunctions and software bugs, causing all sorts of issues. What’s even worse, some nodes might be “dishonest” actors — providing the rest of the network with wrong and conflicting information to achieve some nefarious goal.
The natural question comes: can we protect against such non-deterministic “anything-possible” failures and if yes, how? The answer was presented in a 1982 paper by Lamport, Shostak and Pease on an abstract scenario known as the Byzantine Generals Problem (hence Byzantine failures and Byzantine Fault Tolerance). In this scenario, several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. The generals can communicate only by messenger and must decide on a common plan of action — either attack or retreat. The caveat is that some of the generals might be traitors, trying to prevent the loyal ones from reaching an agreement, resulting in an uncoordinated action and ultimately a defeat.
The generals must have an algorithm to guarantee that:
- all loyal generals obey the same order
- if the commanding general is loyal, then every loyal general obeys the order he sends
If we translate this into the domain of distributed systems: all healthy honest nodes must agree on the same value, proposed by one of the nodes (assuming the proposing node is honest). If the proposing node isn’t honest, the remaining nodes still have to agree on one value, however it might be different from the one proposed — usually, some default value is used (in the Byzantine generals scenario “retreat” is the default value; more on this later).
Before we move any further, here are the assumptions we’ll be working with:
- synchronous model — can detect missing message
- oral messages (as opposed to signed messages, more on which later) — the receiver knows who sent the message, however cannot guarantee the content hasn’t been altered
Let’s start with a scenario with 3 nodes, having one traitor:
Once the lieutenants receive the order from the commander, they exchange the received values amongst each other to verify the validity and agree on common action. In the first figure, lieutenant 2 is a traitor and sends false message to lieutenant 1. Lieutenant 1 thus receives conflicting messages. Based on the stated algorithm requirements, when the commander is loyal, the lieutenant must obey his order — hence lieutenant 1 must choose to attack.
In figure 2, the commander is the traitor, sending different orders to the lieutenants. Lieutenant 1 will end up with the same conflicting messages as in figure 1, being unable to differentiate between the two cases. If he chooses to obey the commander, lieutenant 2, who will face the same dilemma, will carry out different order than lieutenant 1 — consensus will not be reached! If, instead, the lieutenants opt for a default value (retreat) when facing conflicting messages, then the requirement of following a loyal commander’s order will not be met in the case from figure 1!
The previous example shows that the Byzantine generals problem cannot be solved for just 3 nodes. In fact, the paper concludes that Byzantine fault tolerance can be achieved only if no more than 1/3 of the overall number of generals is traitorous (in other words, for m dishonest nodes, there must be at least 3m + 1 honest nodes to achieve BFT). So, for 1 dishonest node there must be a minimum of 4 nodes in total:
In the above picture, a loyal commander sends value v to the lieutenants. The lieutenants then exchange the received values. Lieutenant 3 is a traitor, sending false value x to lieutenants 1 and 2 (the picture only shows messages for lieutenant 2). Both lieutenants thus receive values (v, v, x). They then simply use a majority function to pick a value, in this case, value v.
Let’s consider a scenario where the commander is traitorous, sending different values to each of his lieutenants:
After the lieutenants exchange the received messages, they each end up with the following values: (x, y, z). No value has a majority. In this case, the lieutenants will use a default value (retreat). Note that if, for example, the commander sent the value x (instead of value y) to lieutenant 2, the lieutenants would agree on value x, as it would obtain the majority. The goal of the solution is for the nodes to agree on one value, not to determine whether a value is true or false, or whether the proposing node is honest or not.
It is important to understand that the solution is recursive. In each iteration, the commander sends the value to the lieutenants. Each lieutenant then becomes the commander in the next iteration, sending the value to the remaining lieutenants, and so on. The number of additional iterations necessary to achieve BFT equals the number of maximum dishonest nodes assumed. In the example above, we assume 1 dishonest node at max and thus there is just one iteration required.
This has serious implications on the practicality and performance of the solution:
- you must be conscious about the maximum number of dishonest nodes your system can tolerate
- the higher the maximum number of dishonest nodes, the higher the total number of nodes required to achieve BFT (n ≥ 3m + 1 where n is the total number of nodes, m is the number of dishonest nodes)
- the higher the maximum number of dishonest nodes, the more iterations of the algorithm (and hence messages) required — message complexity is exponential: nᵐ
Having an exponential message complexity clearly isn’t very efficient. One way of simplifying the solution and improving the performance is by restricting the nodes’ ability to lie using signed messages:
- any alteration of a signed message can be detected
- a loyal general’s signature cannot be forged
- anyone can verify the authenticity of a general’s signature
With signed messages, a solution for just 3 nodes exists:
The commanding general, who is a traitor, signs conflicting orders and sends them to the lieutenants. Each lieutenant then signs the received order and forwards it to the remaining lieutenants. The difference from the oral messages is that this time the lieutenants know that the commander is the traitor because the original orders were signed by him. They can thus safely choose to retreat.
Without going into the nitty-gritty details of the algorithm, here are its attributes:
- the total number of nodes required is n ≥ m + 2 (the solution is vacuous otherwise — it is pointless trying to reach consensus in a system with just 1 or 0 honest nodes)
- message complexity is polynomial: n²
Keep in mind however that even though the message complexity was reduced dramatically, some performance degradation will incur due to the fact that we must generate and verify the signatures for every message.
Example: Byzantine 2-Phase Commit
Given the acquired knowledge, can we make the 2-Phase commit algorithm tolerant to Byzantine failures? The answer is yes, albeit it technically won’t be a “2-Phase” commit anymore as it requires adding another phase after the third step:
Coordinator Participants 1. QUERY TO COMMIT ------------------------------> 2. VOTE YES/NO <------------------------------ 3. COMMIT/ROLLBACK ------------------------------> 4.BROADCAST |<->|<->|<->| 5. ACKNOWLEDGMENT <------------------------------
In step 4, each participant broadcasts the message received in step 3 to the remaining participants. Once the broadcast is done, the participants will use the majority function (assuming oral messages) to pick the action to execute. The rest of the algorithm continues as in standard 2-Phase commit.
Why do we need the broadcast only after the third step and not after the first two? Remember the goal of the algorithm is to achieve consensus, regardless of whether the value is true or whether it comes from an honest node:
- in step 1, there is no way for the coordinator to lie, the worst thing that can happen is message omission/crash, which is already handled by standard 2-Phase commit.
- in step 2, a lying participant might affect the chosen value (responding with rollback will rollback the transaction), but cannot prevent the healthy nodes from reaching consensus as the final value is decided and proposed by the coordinator in step 3.
That does it for part 1. Part 2 will cover main consensus algorithms in the blockchain, their approach to the FLP problem and Byzantine failures and how they compare to each other in terms of performance and other characteristics. I will place the link here once ready. Stay tuned.