In this article, I am trying to implement and test the efficiency of a cuckoo filter over a bloom filter. (Read previous post on Chord DHT , implementing a distributed hash table in Golang) Introduction Probabilistic data structures are very useful, especially when processing large data sets. Most of the times, whilst working on the data side of things, one would want to do a simple “is the item not present” or “is the item already present” query whilst processing the real time data. Say you want to answer queries in real time, like number of unique ips, most frequent ips, if an ad has already been served to a user, using probabilistic data structures provide a space efficient way to answer these queries. The typical approach to such queries would be to use either a HashMap or a HashTable, or store it is some external cache (like redis), but the problem is with large datasets, these simple data structures can’t fit into memory. This is where probabilistic data structures come into play because of their space and time advantages. Example Use cases Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation. uses Bloom filters to check if an article has already been recommended to an user Medium Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain The web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result) Google Chrome What’s in a “Cuckoo”? We’ve used bloom filters in many places for answering such queries on the data platform. Recently I came across this paper on which caught my interest. The title itself says, so I decided to check it out. Cuckoo filter “Cuckoo Filter: Practically Better Than Bloom”, Cuckoo filters improve upon the design of the bloom filter by offering deletion, limited counting, and a bounded false positive probability, while still maintaining a similar space complexity. They use cuckoo hashing to resolve collisions and are essentially a compact cuckoo hash table. Cuckoo and bloom filters are both useful for set membership testing when the size of the original data is large. They both only use 7 bits per entry. They are also useful when an expensive operation can be avoided prior to execution by a set membership test. For example, before querying a database, a set membership test can be done to see if the desired object is even in the database. Algorithm Parameters of the Filter:
1. Two Hash Functions: h1 and h2
2. An array B with n buckets. The i-th bucket will be called B[i]

Input: L, a list of elements  to be inserted into the cuckoo filter.

Algorithm:
While L is not empty:
    Let x be the first item in the list L. Remove x from the list.
    If B[h1(x)] is empty:
        place x in B[h1(x)]
    Else, If B[h2(x) is empty]:
        place x in B[h2(x)]
    Else:
        Let y be the element in B[h2(x)].
        Prepend y to L
        place x in B[h2(x)] Implementation The implementation seems pretty straightforward, so I decided to have a go at it and compare how space/time efficient it is compared to a bloom filter. The Cuckoo filter consists of a that stores the ‘fingerprints’ of items inserted. The fingerprint of an item is a bit string derived from the hash of that item. A cuckoo hash table consists of an array of buckets where an item to be inserted is mapped to two possible buckets based on two hash functions. Each bucket can be configured to store a variable number of fingerprints. Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints. Cuckoo hash table Insertion Algorithm: f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has an empty entry then
   add f to that bucket;
   return Done;

// must relocate existing items;
i = randomly pick i1 or i2;
for n = 0; n < MaxNumKicks; n++ do
   randomly select an entry e from bucket[i];
   swap f and the fingerprint stored in entry e;
   i = i ⊕ hash(f);
   if bucket[i] has an empty entry then
      add f to bucket[i];
      return Done;

// Hashtable is considered full;
return Failure; Code: <a href="https://medium.com/media/fb67f33fdc8334fa7f9ec0af7bb85fce/href">https://medium.com/media/fb67f33fdc8334fa7f9ec0af7bb85fce/href</a> Search Algorithm: f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has f then
    return True;

return False; Code: <a href="https://medium.com/media/90d845abd04c0e11f41bbe113395645f/href">https://medium.com/media/90d845abd04c0e11f41bbe113395645f/href</a> Delete Algorithm: f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has f then
   remove a copy of f from this bucket;
   return True;

return False; Code: <a href="https://medium.com/media/31a0be9921f81e3f26d1eb425c147a3d/href">https://medium.com/media/31a0be9921f81e3f26d1eb425c147a3d/href</a> Performance Test I’ve used Will Fitzgerald library for the test on . The FPP (False Positive Probability) ration taken for cuckoo filter is 0.001 Bloom filter <a href="https://medium.com/media/eb93ffc65a9a4878e55c27b47d12597d/href">https://medium.com/media/eb93ffc65a9a4878e55c27b47d12597d/href</a> Space Complexity With regards to the cuckoo and bloom filters, they perform differently at different false positive probabilities. When the false positive probability of the filter is less than or equal to 3%, the cuckoo filter has fewer bits per entry. When it is higher, the bloom filter has fewer bits per entry. Time Complexity In cuckoo hashing, inserting an element seems like much worse than O(1) in the worst case because there could be many instances during collision, where we have to remove a value in order to make room for the current value. Plus, if there is a cycle then the entire table must be rehashed. Doing a time analysis of both the filters yields the following results: <a href="https://medium.com/media/8991ffe2b9d120643326957ae8d09a50/href">https://medium.com/media/8991ffe2b9d120643326957ae8d09a50/href</a> Throughout this experiment (keeping in mind my code may not be fully optimized), Bloom filters seem to perform exceptionally well in space complexity, occupying less amount of space for large number of items. Cuckoo filter seems to perform better at insertion of large number of items, but a little slower in lookup(search times) due to it’s implementation. Inference I wouldn’t really take a side on which filter to recommend, I think they both have their own use cases. Bloom filters do not support deletions because hashing is lossy and irreversible. Though counting bloom filters solve that problem, Cuckoo filters are useful in the case where you would require deletions. Of course Cuckoo filters give an error when the filter is full, and that has it’s own advantages, whereas in a Bloom filter, there is no control over capacity, it just rehashes over the existing bit array. Code arriqaaq/cuckoo References https://brilliant.org/wiki/cuckoo-filter/ https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf https://en.wikipedia.org/wiki/Cuckoo_hashing https://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html P.S If you find anything wrong with the tests/implementation, please feel free to leave your suggestion/comments.

CUCKOO vs BLOOM filter, from a Gopher’s perspective

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Consistent hashing with bounded loads, using a Red-Black Tree

10 Things Everyone Should Know About Machine Learning

10 Repositories that Will Transform the Way You Approach Technical Interviews

10 (Free) Data Structure and Algorithm Courses Junior Developers Should Explore

10 Data Structure & Algorithms Books Every Programmer Should Read

The Noonification: How to Develop a DSL in Kotlin (12/12/2023)

Consistent hashing with bounded loads, using a Red-Black Tree

10 Things Everyone Should Know About Machine Learning

10 Repositories that Will Transform the Way You Approach Technical Interviews

10 (Free) Data Structure and Algorithm Courses Junior Developers Should Explore

10 Data Structure & Algorithms Books Every Programmer Should Read

The Noonification: How to Develop a DSL in Kotlin (12/12/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps