If you are interested in blockchains and cryptocurrencies, it is likely that you may have stumbled across Merkle trees (also known as hash trees). It is worth noting that almost every blockchain project uses Merkle trees as part of its protocols. In version 5.0.0 of the Lisk SDK, we are introducing lisk-tree, the new library that will handle Merkle trees in the Lisk protocol.
This blog post can be broken down into 2 sections. The first section is designed to provide a crash course into Merkle trees, covering the most important concepts, such as Merkle roots and proofs-of-inclusion. In the second section, we introduce lisk-tree, describe how Merkle trees are implemented in Lisk, and finally explain the improvements they bring to the Lisk protocol.
Below you find the main technical terminology used throughout this blog post as a reference point in order to help provide a clear understanding. When not specified, we assume that each data value is a byte array.
Figure 1: A Merkle tree obtained from the names of the members of the research team. The white boxes represent the original dataset, the green boxes are leaf nodes, and finally the yellow boxes are branch nodes. At the top of the tree sits the Merkle root. Unpaired nodes are pushed to the layer above without hashing them (dashed boxes). Please note that in this blog post we shortened the output of the hash function for ease of reading.
A Merkle tree is an authenticated data structure organized as a tree. Let's try to understand what exactly each of these words means:
To build a Merkle tree, proceed as follows:
In Fig. 1, an illustrative example of a Merkle tree constructed from the names of the Lisk research team members can be seen.
Figure 2: The Merkle tree shown in Fig. 1 is used to illustrate the proof-of-inclusion. In order to prove that the data value "Iker" (blue-bordered box) is part of the tree, the Prover sends the hashes in the red-bordered boxes, along with their positions in the tree. The Verifier can use this information to recompute the Merkle root (top blue-bordered box) and check that it corresponds to the same one they have.
One could argue why to go through all the hassle of building the tree, when it would be easier to just hash the original data directly to get a single root. The point is that once the tree is built, it is possible to create short proofs-of-inclusion for any data present in the original dataset. These proofs-of-inclusion are just the set of the intermediate hash values needed to recalculate the Merkle root.
There are two parties involved in a proof-of-inclusion: the Verifier and the Prover. The Verifier’s responsibility is to check that certain data is part of the original dataset, while the role of the Prover is to prepare the proof and give it to the Verifier.
We assume that the Verifier knows the Merkle root and (obviously) the data they wish to verify, but not the whole dataset. Whereas the Prover has access to the entire tree (or just to the original dataset, from which they can reconstruct the tree).
The procedure for a proof-of-inclusion is as follows:
An example of a proof-of-inclusion can be seen above in Fig. 2. The Verifier holds the data in the blue-bordered boxes (the data block "Iker" and the Merkle root). They can verify that "Iker" is part of the research team by asking for a proof-of-inclusion, which in this case includes 3 hashes (red-bordered boxes), the index of "Iker" in the tree (2 in this case), and the length of the dataset (5).
Using this information, the Verifier knows that they have to hash "Iker" on the left with the first hash in the proof, then the result has to be hashed on the right with the second proof element, and finally on the left again with the third one.
Without the Merkle tree construction, the Verifier would need to request all the other 4 elements of the dataset. It doesn't look like we're saving much here (only 3 hashes versus 4).
However, the important point to recognize here is that the size of a proof-of-inclusion only scales logarithmically with the size of the original dataset (rather than linearly).
In the Lisk protocol, a full block can contain at most ~128 transactions. In this case the proof-of-inclusion for a transaction would have length Log 128 = 7 (where Log indicates the logarithm in base 2), rather than 127. This is already a good save! In fact, the Verifier can also check the presence of multiple data blocks at once. This can actually save space in the proof, as some of the required hashes can be directly calculated by the Verifier.
An additional point to mention here is that the original dataset in Fig. 2 is ordered alphabetically. This means that if we give a proof-of-inclusion for two adjacent elements in the dataset, it will constitute a proof-of-non-inclusion for any data block that would stay in between.
As an example, we can prove that there is no "Gianluigi" in the research team by proving that both "Andreas" and "Iker" are present and are located next to each other in the original dataset (this can be checked using their index).
This is a general property: If we know for sure that the original dataset will always be ordered (according to a certain rule), it is always possible to give a proof-of-non-inclusion using this strategy.
In this section we go over some use cases for Merkle trees in the Lisk protocol, and describe the specifications of lisk-tree. If you would like to see further information, please take a look at the specifications provided in LIP 0031.
Use Cases
Lisk will benefit from Merkle trees in the following places:
Technical Specifications
If you made it this far, congratulations! You now know what Merkle trees are and why we are implementing them in the Lisk protocol. This final section gives more insights into some technical choices that we made. In particular, we will cover the concept of branch and leaf hash and why it is important to define different hashing functions. In general, we chose to follow the Certificate Transparency specifications. In the following,
hash
is the SHA256 hashing function and |
indicates bytes concatenation.Leaf versus Branch Hash
Figure 3: (top) The Merkle tree built from the dataset
x=[D1,D2,D3,D4]
. (bottom) The Merkle tree built from the dataset x’=[D1,D2,D3'=hash(D3)|hash(D4)]
. If we use the same hash function for leaf and branch nodes, L3'=B2
, and thus both trees have the same root R
.In our construction, Merkle trees accept input dataset with arbitrary length. Due to this, they could potentially be susceptible to second pre-image attacks. In a second pre-image attack, the goal of the attacker is to find a second input
x′
that hashes to the same value of another given input x, i.e. find x’≠x
such that hash(x)=hash(x’)
. In Merkle trees, this corresponds to having two different input datasets x
and x’
giving the same Merkle root. But this is easy to do: for any
x=[D1,D2,D3,D4]
, just choose x’=[D1,D2,D3'=hash(D3)|hash(D4)]
, as shown above in Fig. 3. Notice that L3'
equals B2
, resulting in the same Merkle root.To solve this problem, we prepend a different flag to the data before hashing it, depending on whether the resulting node is a leaf or branch node. In particular, we define leaf and branch hash functions as the following:
leafHash(data) = hash(00|data)
branchHash(child1, child2) = hash(01|child1|child2)
If we try to use the same trick, this results in two different roots, as now
D3
and D4
would be hashed together differently in the two cases:L3'=leafHash(hash(D3)|hash(D4))
B2 = branchHash(hash(D3)|hash(D4)).
Appending Data to the Tree
Figure 4: (top) When appending new data to the dataset, we can update the tree using only the hashes in the append path (purple-bordered boxes). (bottom) The updated tree. Nazar is now a member of the research team! The new append path is again indicated by purple-bordered boxes.
The original dataset from which the tree is built can later be expanded to append new data. In this case, the Merkle root, and in general the tree, in principle have to be recomputed from scratch. It turns out that it is possible to append new data to an existing tree efficiently.
What this means is that the number of operations needed to update the tree is only logarithmic in the size of the dataset. To do this, we store an additional array, called the append path. The append path contains at most Log N values, with N the size of the original dataset.
Using the append path, we can append new data to the tree as follows:
In Fig. 4 (top), the append path is indicated by the purple-bordered boxes. After appending a new data block to the tree, the Merkle root and the append path can be updated just using the old append path, resulting in the new tree in Fig. 4 (bottom). The new append path is indicated by the purple-bordered boxes.
The lisk-tree library introduces Merkle trees into the Lisk protocol as specified in LIP 0031. Merkle trees will be used to store the transactions root in the block header (LIP 0032) and to perform the decentralized regenesis (LIP 0035). However, in the future they may be important for other aspects of the protocol such as interoperability.
For further questions and comments on the topic, we will host an AMA on Lisk.chat with Alessandro Ricottone (Research Scientist) next Thursday, September 24th at 4 pm CEST. We also invite all community members to go to the Lisk Research forum where we are always happy to hear your feedback and to participate in discussions on this and other topics.