Bloom filters are a data structure developed by Burton Howard Bloom in 1970. You can see them as a hash tables’ cousin.
They also allow for efficient insert and lookup operations while occupying very little space.
The price to pay is false positives.
The probability of a false positive is a function of the size of the filter and the number of hash functions, as we will see in the next section.
In contrast with hash tables, Bloom filters do not store objects. They only remember whether an object has been added to the filter. In this regard, they behave more like sets. Furthermore, they do not allow deletions.
After introducing the ideas behind a Bloom FIlter and some example applications, I will focus on how they are used by Bitcoin light clients.
Outline
You can implement a Bloom filter as an array of n bits. You also need k hash functions. I will explain the
Insert
and Lookup
operations and then give you an example where we will insert elements into a filter, perform lookups and examine how false positives can occur in a Bloom Filter.Insertion
To add a new element to the filter F, you iterate through the k hash functions. For each of them, you hash the element, obtaining a number n in the set {0, 1, 2, ..., n-1} , corresponding to a position in the array. You set F[n] = 1 and move to the next hash function.
Lookup
Similarly, you iterate through all the hash functions. If an element is in the set, the output of all the hash functions will be indices where the bit is set to 1. If any of the indices is set to 0, the element is not part of the filter.
Example
Consider the scenario depicted in the image above. The filter is an array of 10 elements and we are using two hash functions, h1 and h2. We initialize the array to .
To check if a word is part of the filter, we check the entries corresponding to the output of all hash functions. For Hello, the bits 0 and 3 are set to 1. For World, bits 8 and 9 are set to 1. Both strings are part of the filter.
Suppose we want to see if Dog is in the filter and h1("Dog") = 1 and h2("Dog") = 2. Both entries are 0. Therefore, the element is not in the set.
Now, let's lookup Cat. Imagine that h1("Cat") = 0 and h2("Cat") = 1. One of the elements is 0. If Cat was in the filter, filter[h1(Cat)] would be 1. Therefore, Cat is not in the filter.
Now, let's imagine h1(Bird) = 0 and b2(Bird) = 9. In our filter, both entries evaluate to 1, but we never added $ Bird $ to the set. This is an example of a false positive.
Let’s analyze the probability of a false positive.
Let’s assume we have a dataset S, k hash functions, and a filter F of n bits. After inserting all the elements in S, the probability that any bit is still set to 0 is
(1 - 1/n)^{|S|k}
where |S| represents the cardinality of the set (the number of elements in S). Therefore, the probability of a given bit is 1 after all elements have been inserted in F is
1 - (1 - 1/n)^{k|S|
We can approximate this using the exponential function since tends to 0 as n increases
1 - (1 - 1/n)^{k|S|} ~ 1 - e^{-k|S|/n} = 1 - e^{-k/b}
where b = n/|S| is the number of bits per object.
Let’s take an element X that does not belong to the filter. Let’s also assume that each hash function will produce a different output for X. For X to be a false positive, its k bits must be set to one. Thus, the probability of a false positive is
P_{fp} = (1 - e^{-k/b})^k
We can fix b and compute the number of hash functions that minimize this probability. We differentiate the previous equation, make it equal to 0 and solve for k. As a result, we obtain
k = ln(2) * b
Thus, the probability of false positives as a function of the number of bits per object is
P_{fp} ~ (1/2) ^ {ln(2)*b}
We can also fix this probability of false positives (for instance, to 1%) and solve for
b = - ln(P_{fp})/ln(2)^2
If we wish to insert elements, the optimal size of the filter is
n = - ln(P_{fp})/ln(2)^2 * |S|
Before diving into Bitcoin, let’s consider some other applications of Bloom Filters.
Spell checker
This was one of its earliest applications. You could create a filter and insert all words in a dictionary. When you want to check if a word is correctly spelled, check if it is in the filter. If it is not, you know it is not properly spelled.
If the word belongs to the filter, there is a tiny probability that it is misspelled (false positive).
Reduce access to disk
Imagine you have a huge userbase, like Twitter. When a new user signs up, he has to choose a Twitter handle.
Looking up in the database if the handle is available is slower than using memory. We could add all existing Twitter handles to a hash set, but that would take a lot of space. Since we can afford false positives, we can give Bloom Filters a try.
Firstly, you insert into the Bloom filter the handles that are already in use. Over time, you can add more as users sign up. When a client tries to register a certain handle, you look it up in the Bloom filter. If you get a “False” result, you can let the user pick that handle.
If you get a “True” result, either the handle is unavailable or we have a false positive. In this situation, we can check in the database where we store all handles in use.
As a result, Bloom Filters would reduce the number of lookups in the database.
Elements cannot be deleted from a Bloom Filter, only inserted. Therefore, if closing an account makes the handle available again, we need to periodically recreate the Filter.
Reduce network calls
We can apply similar ideas to a system that needs to send network calls to another system. For example, Alexa uses blacklists to ensure it does not show certain types of information. Imagine a service B that keeps a database of items to be blacklisted. Service A makes lots of calls to service B.
To reduce the number of calls, service A could use a Bloom Filter, adding all entities from the blacklist. A negative result from the lookup means the element is not on the blacklist and therefore safe to process. A positive result means that the item might be on the blacklist. To avoid false positives, we can make a call to service B.
The Bloom Filter needs to be synced periodically with the database in service B.
In conclusion, using a Bloom Filter reduces the number of network calls. On the other hand, it adds complexity. Nonetheless, depending on the requirements, it could be a trade-off worth considering.
A step further
To sum up, Bloom Filters are a good solution for applications that:
Otherwise, hash tables or hash sets might be a better alternative. However, what if you could leverage false positives?
In my previous article about Merkle trees, I introduced the concept of light nodes. Light nodes only download block headers and query other nodes for a Merkle path to verify transactions.
Imagine a light node that wants to receive transactions to go to addresses contained in its wallet. The node could talk to full nodes asking for transactions of interest. However, the rest of the nodes would know what addresses this light node is interested in, which is a concern from a privacy perspective. How can Bloom filters help with this?
Entering Bloom Filters
Bloom filters are a data structure that allows for a probabilistic check of membership. You cannot tell with 100% probability if an element present in a Bloom filter is really there or a false positive. Therefore, light nodes can use them to ask their peers for transactions, without revealing exactly what information the light node is interested in.
Light nodes will create a Bloom filter and insert into it the addresses that its wallets contain. When a light client connects to a full node, it sends this Bloom filter. In turn, whenever the full node receives a transaction, it checks if its input or output addresses match the filter. Peers will only send transactions that match the filter.
To be more precise, they send the block header and the Merkle path of that transaction to the root of the tree. Using the Merkle path, the node can verify that the transaction belongs to the block. Use the block header, the node can link this block to the rest of the blockchain. All this using an infinitesimal fraction of the space that the full blockchain requires.
A light node will discard false positives and use the rest transactions to update its UTXO set. UTXO stands for Unspent Transaction Output and the UTXO set is “the way Bitcoin keeps track of who owns what”. I will cover this concept in more detail in an upcoming article about blocks and transactions.
Privacy and Security
Using Bloom Filters, a light node does not need to reveal to the rest of the participants which addresses it is interested in. This would not be a security leak per se, but a privacy leak that could end up compromising security.
For more information, have a look at this paper.
Hash Functions
Instead of using k different hash functions, Bitcoin uses only one, murmur3, with different seeds. Murmur3 is not a cryptographic hash function. This implies that it does not have the same properties as, let’s say SHA256, but it is faster which makes it ideal for this use case.
The following equation defines the ” k different hash functions”:
hash_i=i*0xFBA4C795 + nTweak, for i in {0, 1, ..., k-1}
where nTweak is a random constant.
You can choose two parameters to define your Bloom Filter:
which are familiar to you now from the previous sections.
I hope now you have a better understanding of Bloom Filters, their design trade-offs, and how they can be used in different contexts. As a next step, I recommend you check Bitcoin’s source code and BIP0037, which defines the specifications for Bloom Filters in light nodes.
I hope you found this article helpful. Share it, because you could help someone pass their exams or get a job.
You can access more of my content on my Twitter and my blog.
Previously published at https://www.yourdevopsguy.com/bloom-filters-and-bitcoin/