Merkle trees, also known as , are a type of . They can be used to . They are used in both blockchain and non-blockchain-related projects. binary hash trees binary tree efficiently verify the integrity of large sets of data Merkle trees are build from the bottom up: The leaves of the tree contain the hash of each element in our dataset. This dataset could represent a list of transactions, parts of a file, etc. Each internal node is the result of the concatenation of the hashes of its two children. hashing Outline Hash functions Properties of a Merkle tree What problems can a Merkle tree solve? Implementing a Merkle tree in JavaScript Known vulnerability Sparse Merkle Trees Conclusions Hash functions Hash functions take as input any piece of data (a string, an image, etc.) and produce a fixed-length output. An example of a hash function is SHA-256, which takes data of arbitrary size and produces an output that is 256 bits long. Good hash functions have the following properties: . They produce evenly distributed outputs in the possible output range. In the SHA-256 example, this range would go from 0000...0000 to 1111....1111. Uniformity . The same input will always generate the same output. Deterministic . Fast to compute need additional properties to make them more secure: Cryptographic hash functions . Given an input x, they generate an output h(x). From an output h(x), it is computationally infeasible to find the input it came from, x. One-way functions . The likelihood of 2 different inputs x,y producing the same output h(x) = h(y) should be negligible. Collision resistant . Given an output h(x), it is infeasible to find an input such as h(x) = h(y). Resistant to preimage attacks y A will produce a . This is known as the . minor modification in the input completely different output avalanche effect Properties of a Merkle Tree The properties of a Merkle tree derive from its structure and its use of hash functions: . The root of the tree will change if any node in the tree is modified. Even the same dataset, with its elements presented in a different order (ex: [T2, T1, T3, T4] instead of [T1, T2, T3, T4]) will produce a different tree. Data integrity . We do not need the full tree to verify whether a transaction belongs to it. We only need a path that starts in one of the leaves and goes all the way up to the root (proof of inclusion). Partial verification ceil(log2(N)). The function ceil(x) maps x to the smallest integer greater or equal than x. Ex: ceil(3.14) = 4 and ceil(3) = 3. The height of a tree of N leaves is What problems can a Merkle tree solve? Introduction to Blockchain I will use to illustrate how Merkle trees can verify if a transaction belongs to a block. Bitcoin In case you aren’t familiar with blockchains, we can oversimplify things imagining that a blockchain is a of elements called . contain a list of that stores . Part of this metadata is the root of the block’s Merkle tree, called . distributed linked list blocks Blocks transactions and a header metadata Merkle root , to ensure immutability. If the previous block in the chain changes, even by one bit, its hash would change and it would break the chain. Each block points to the hash of its predecessor Imagine you are running an app that needs to verify if a transaction belongs to a block . In Bitcoin, store a block’s Merkle root along with all its transactions. For them, it is easy to verify if belongs to . Also, they can generate proofs for any transaction in a block on demand. T B full nodes T B What if the app runs on a device with limited disk, computing power, and bandwidth? To determine whether belongs to , the device needs information. It can retrieve this information from other participants in the network. T B Problems The following problems arise: How can we ? Data might be corrupted and other nodes in the network can send fake data. trust that the data we receive from other nodes is correct Given the limited resources, how can we ? minimize the amount of information we need to receive from the network Merkle trees can help to solve these problems. From the properties of Merkle trees we covered above: solves problem 1. Let’s assume we receive different pieces of information from a series of participants in the network. We have , which contains its Merkle root. The data we receive from the network is used to build a Merkle tree whose root must be the same as ‘s Merkle root. Otherwise, we cannot trust the data that others have shared with us. Data integrity B B and having a solve problem 2. Hashing all the transactions would also work to verify data integrity. However, that does not scale well, because proof generation and verification would be O(N) operations. Partial verification height of ceil(log2(N)) commonly known as , do not store full blocks. They interact with the network to request the information they need to verify transactions. They can assess whether a transaction belongs to a block requesting an and the block header, where the is stored. Simplified Payment Verification(SPV) nodes, lightweight nodes or light clients inclusion proof Merkle root Implementing a Merkle Tree in JavaScript The algorithm to generate a Merkle root from a list of transactions is as follows: The leaves of a Merkle tree contain the hashes of the transactions that belong to a block. In the image above, [T1, T2, T3, T4, T5, T6] represent transactions. From them, we generate the leaves of the Merkl tree using a hash function. If the number of transactions is odd, we duplicate the last node. We take 2 consecutive leaves and hash the concatenation of their hashes. This generates their parent. In the image, T1’s and T2’s parent is equal to hash(hash(T1) + hash(T2)). We keep repeating this process with all the transactions at the same “level”. These parents will be siblings that will form a new level in the tree hierarchy. We keep repeating this process until we only have one element left, the root of the tree. You can find my implementation of a Merkle tree. It can verify a Bitcoin block’s Merkle root and generate and verify proofs of inclusion. here Known vulnerability Bitcoin’s Merkle tree duplicates the last node in levels with an odd number of nodes. Also, if Bitcoin finds a block that is not valid, it caches its root to avoid trying to mine it again. This combination makes the tree susceptible to : for an input x, we can find a second input y, different from x, such that h(x) = h(y). second preimage attacks How does this design suffer from this? Let’s go back to our first example, based on the tree represented in the image below: The leaves of the tree come from a block containing these transactions [T1, T2, T3, T4, T5, T6]. B1 The level right above the leaves is [hash(hash(T1) + hash(T2)), hash(hash(T3) + hash(T4)), hash(hash(T5) + hash(T6))]. Since we have an odd number of nodes, we duplicate the last one. Hence, the next level changes to [..., ..., hash(hash(T5) + hash(T6)), hash(hash(T5) + hash(T6))] This is equal to the level produced by a block with the following transactions [T1, T2, T3, T4, T5, T6, ]. B2 T5, T6 contains duplicate transactions, making the block invalid. Nodes would cache this root so that next time they find they discard it. Both B1 and B2 produce the same root. Thus, nodes will reject , even though B1 might be a valid block. B2 a block with this root B1 For more details, have a look at and in Bitcoin’s source code. this post this comment There is another known vulnerability in Bitcoin’s original Merkle trees (already fixed). Understanding it requires more knowledge of how Bitcoin builds transactions. Since this is an introductory article, I have decided to leave it out. However, I might cover transactions in more depth in future articles and revisit this vulnerability. Sparse Merkle Trees A Sparse Merkle Tree (SMT) is a variation of a Merkle Tree, where every leaf node has an index along with its value. All leaves are initialized to an empty (null) value. By design, SMTs have 2^256 leaves. The Merkle tree I described above can verify that a transaction is part of the block by exploring the path from its corresponding leaf to the root of the tree. However, they cannot be used to prove that a transaction does not belong to a block. And in general, to prove that elements do not belong to a dataset. SMTs can do this. SMTs can produce proofs of non-inclusions in an efficient manner. The idea is that we know the index of every element. Thus, you know which leaf contains an element. If an element is not part of the tree, the leaf where the element would be stored will be null. Therefore, to produce a proof of non-inclusion, we produce a proof of inclusion generated from that node initialized to null. Proofs of inclusion work in the same way as Merkle trees’. Updating a leaf an SMT is also efficient because we only need to modify the path that goes from that leaf to the root. Since SMTs have 2^256 leaves, only 256 nodes need modification. SMTs deserve their own article, but I wanted to at least introduce the basic idea behind them. Conclusions Now you should have a clearer idea of what Merkle trees are and what they are used for. If you want to learn more, I suggest you investigate how: Git uses them to keep track of changes in files.DynamoDB uses them to identify any inconsistencies between nodes. I hope you found this article helpful. Share it, because you could help someone pass their exams or get a job. You can access more of my content on and my . my Twitter blog Previously published at https://www.yourdevopsguy.com/merkle-trees-and-bitcoin/