Bloom Filters - Power in Simplicity

I first heard of Bloom filters from my professor at IIT Madras when talking about caching. I was young and new to the world of computer science then, but even so I remember being fascinated by how simple yet powerful bloom filters are. After close to 9 years, I realize that this simple yet powerful data structure is still relegated to the side when someone thinks of mainstream data structures. This article is to correct that.

What are Bloom Filters?

Bloom filters are a probabilistic data structure used for membership checks. They give true negatives and probable positives i.e when I ask a bloom filter if an element exists in it, its answer would be ‘definitely not’ or ‘maybe yes’. The advantage is they do this in near-constant time using every little extra space. Curious, isnt it? I will now explain how a bloom filter works and you will understand the nature of their answers.

A bloom filter has 2 components:

k hash functions: A hash function is just a function that takes a string and outputs a mathematical ‘fingerprint’ of it.
bit array of size N: Nothing fancy, just N bits.

Adding an element

When we add an element E to the bloom filter, we do the following:

hash E with each of the k hash functions and generate k hashes h1, h2…hk.
Find the reminder when each of these hashes is divided by N. Let these be r1, r2, …, rk.
Turn on the bits corresponding to r1, r2 … rk in the bloom filter’s bit array.

If this is too complex to follow, let me explain the addition with a simple example using a small bloom filter with 2 hash functions and just 16 bits (2 bytes) storage.

Example: Add

Bloom Filter construction:

2 hash functions: Here, I am going to use some imaginary function that give me the values I need to explain.
16 bits storage, all set to 0 initially. The bits are indexed 0 to 15.
Step 1: Add ‘ganesh’

hash1(‘ganesh’) = 2

hash2(‘ganesh’) = 7

So we set the the 2nd and 7th bits in our bitarray to 1.

Step 2: Add ‘palanikumar’

hash1(‘palanikumar’) = 5

hash2(‘palanikumar’) = 12

So, we set the 5th and 12th bits in our bitarray to 1.

After adding both these elements, the bits 2, 7, 5, 12 are set to 1.

Membership check

To check if an element Q is present in a filter, we do the following:

Find the indexes that correspond to Q using the same method described above.
If any one of these indices are 0, that means Q is definitely not in the set. If the element had been added to the set, every one of these bits must have been turned on.
If all of them are 1, it means Q ‘may be’ in the set. Why the may be? It is possible that these bits were turned on at separate times because of multiple other elements.

Example: Check

Following the same example above, let us check for ‘animal’.

hash1(‘animal’) = 6

hash2(‘animal’) = 2

But in our bitarray, index 6 is not set to 1. So, we can say that ‘animal’ definitely does not exist in the set.
Now let us check for '‘lottery’

hash1(‘lottery’) = 2

hash2(‘lottery’) = 12

In our bitarray, index 2 and 12 are indeed set to 1. But we know that ‘lottery’ is not in the set since we inserted only 2 words ‘ganesh’ and ‘palanikumar’. This is an exmaple of the bloom filter returning a false positive.

Why use Bloom Filters ?

Bloom filters are very useful as a data structure to represent sets in memory-constrained environments. They are often used in cache firmware to decide if an object is present in the cache or not before trying to access the object. Database engines like Postgres use Bloom Filters to reduce disk reads. Disk reads are the slowest operations in most systems so reducing them helps to significantly speed up queries.

Bloom Filter vs Set: Evaluationg Time and Space

To analyze the speed vs memory footprint of Bloom filters, I prepared a small script in Python.

Step 1: Define the Bloom Filter

from bitarray import bitarray

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def _hash(self, item, seed):
        return hash(item + str(seed)) % self.size

    def add(self, item):
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            self.bit_array[index] = 1

    def check(self, item):
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            if not self.bit_array[index]:
                return False
        return True

Step 2: Write the test script

import random
import string
import time
from bloom_filter import BloomFilter


def generate_random_string(length):
    return ''.join(random.choice(string.ascii_letters) for _ in range(length))

# Parameters
num_items = 1000000
num_checks = 10000
bloom_size = 10000000
num_hashes = 7

# Create data structures
bloom_filter = BloomFilter(bloom_size, num_hashes)
standard_set = set()

# Generate random items and add to both structures
print("Adding items...")
bloom_add_time = 0
set_add_time = 0
for _ in range(num_items):
    item = generate_random_string(10)
    start_time = time.time()
    bloom_filter.add(item)
    bloom_add_time += time.time() - start_time
    start_time = time.time()
    standard_set.add(item)
    set_add_time += time.time() - start_time
print(f"Bloom Filter add time: {bloom_add_time/num_items:.6f} seconds")
print(f"Standard Set add time: {set_add_time/num_items:.6f} seconds")

# Perform membership checks
print("\nPerforming membership checks...")
bloom_time = 0
set_time = 0

for _ in range(num_checks):
    item = generate_random_string(10)
    
    start = time.time()
    bloom_result = bloom_filter.check(item)
    bloom_time += time.time() - start
    
    start = time.time()
    set_result = item in standard_set
    set_time += time.time() - start

print(f"Bloom Filter average check time: {bloom_time/num_checks:.6f} seconds")
print(f"Standard Set average check time: {set_time/num_checks:.6f} seconds")

# Calculate false positive rate
false_positives = sum(1 for _ in range(num_checks) if bloom_filter.check(generate_random_string(10)))
false_positive_rate = false_positives / num_checks

print(f"\nFalse Positive Rate: {false_positive_rate:.2%}")

# Memory usage comparison
import sys
bloom_memory = sys.getsizeof(bloom_filter.bit_array)
set_memory = sys.getsizeof(standard_set)

print(f"\nBloom Filter memory usage: {bloom_memory/1024:.2f} KB")
print(f"Standard Set memory usage: {set_memory/1024:.2f} KB")
print(f"Memory saved: {(set_memory - bloom_memory)/1024:.2f} KB")

Output

Adding items...
Bloom Filter add time: 0.000004 seconds
Standard Set add time: 0.000000 seconds

Performing membership checks...
Bloom Filter average check time: 0.000002 seconds
Standard Set average check time: 0.000000 seconds

False Positive Rate: 0.84%

Bloom Filter memory usage: 1220.78 KB
Standard Set memory usage: 32768.21 KB
Memory saved: 31547.43 KB

Analysis

As you can see, there is a false positive rate of < 1%. The membership checks are also slightly less performant. But the bloom filter uses just 3% of the space used by the set. In an execution environment where memory is at a premium, Bloom filters provide huge savings in space with moderate reduction in performance.

Conclusion

Bloom filters are an efficient data structure for use in memory-starved environments. They have a lot of practical applications and are even used in popular tools like CDNs and database engines. Do you have any interesting use cases in which bloom filters would be helpful? Do let me know in comments!