A Merkle tree is a binary tree that makes it possible to efficiently and securely verify the contents of large data structures. The concept of this tree was proposed and patented in 1982 by Ralph Merkle, an American cryptographer.
The length of the hash remains the same for any amount of input data. Leaf vertices contain hashes from blocks of data, whereas inner vertices include hashes from the addition of values in two child vertices. In turn, the root vertex comprises the hash of the entire data set.
Initially, there is a set of data that is not related to each other in any way. For each block of data (in the context of a transaction blockchain), we need to compute a hash of the block. The number of data blocks should be 2^N
, because as the tree is built, the previous blocks are hashed in pairs at each iteration. A pair of hashes is then grouped together and a new hash is created based on them. Hashing will continue as long as there is something to group at each level, and it will stop when there is only one pair left from which the root hash – the Merkle root – is obtained.
H(1-2) = keccak256(H(1) + H(2))
H(3-4) = keccak256(H(3) + H(4))
H(5-6) = keccak256(H(5) + H(6))
H(7-8) = keccak256(H(7) + H(8))
H(1-2-3-4) = keccak256(H(1-2) + H(3-4))
H(5-6-7-8) = keccak256(H(5-6) + H(7-8))
H(1-2-3-4-5-6-7-8) = keccak256(H(1-2-3-4) + H(5-6-7-8))
Once we have the root, we can validate the data. If the root hash matches the original hash, everything is fine, the data is valid. It is also possible to efficiently check whether certain data is present in the tree structure.
Let’s say we want to check that the H2
hash is present in the tree. The proof will be an array of data: H1
, H3-4
, H5-6-7-8
. Thanks to these hashes, we can reconstruct the root hash:
H(1-2-3-4-5-6-7-8) = keccak256(keccak256(keccak256(H(1) + H(2)) + H(3-4)) + H(5-6-7-8))
Then we compare the resulting hash with the original one. If they match, it means that the hash H2
is present in the tree.
Let’s now see how to implement a Merkle tree and data validation in Solidity.
For simplicity, we will write the array of transactions directly to the blockchain, as well as the array of hashes. Since we will be using the built-in keccak256
function, which returns a hash of 32 bytes, the datatype of the hashes array will be bytes32
. The array will also have a dynamic length. In general, the length of the hashes array will not be the same as the length of the transactions array, because it will hold the hashes produced. We could calculate the length in advance, but for the sake of code simplification we won’t. If you like, you can make each of these arrays public
and read the hashes of the transactions.
We will produce hashes
in the constructor. The first step is to count the hashes for blocks with transactions
in the loop. Then, at each level of the tree, the number of hashes is reduced by a factor of 2. Since we will be storing the hashes in the hashes array, the index offsets must be stored separately. The iterator in the inner loop is incremented by 2 because we will be taking a couple of elements when calculating and adding hashes. In abi.encodePacked
this time we will add 2 hashes each to which we will set the indices correctly: for the left element – the value of the current shift offset + the index at the current tree level, and for the right element – the same + 1.
So, we have built the tree, but the interest consists in validating the data. To do this, let’s write a validateTransaction
function that takes a transaction and its index. Unfortunately, Solidity does not have a find
method for arrays, so it is easier to just pass the transaction index.
First, let’s calculate the hash of the transaction and create an array of proofs. We find the number of proofs depending on the length of the transaction array and set the length of proofsIndexes
. After it, at each level of the tree, we find the necessary element, depending on the index, taking the right or left element, bearing in mind the offsets in the hashes
. Then we go through the array of proof indices, checking the index for parity and calculating the hash depending on whether the element is left or right in the pair. Finally, we compare the resulting hash with the root hash and return the result.
There are several applications of a Merkle tree in blockchain.
Without Merkle trees, blockchains would be too cumbersome, and Merkle proofs allow for cost-effective verification of data authenticity.
Merkle trees are actively used in blockchains (e.g., Bitcoin, Ethereum) to store transactions efficiently, but this is not the only application. For example, after the collapse of the FTX exchange, many exchanges implemented a Proof-of-Reserve mechanism, which utilizes a Merkle tree.