I recently ran across and it inspired me to write about a neat variation on that I’ve found useful in my work. Quick refresher: a bloom filter is a probabilistic that tests if an element is in a set. False positives are possible when testing if an element is in the set, but negatives mean an element is definitely not in the set. Got it? Cool. this bloom filter post by Michael Schmatz the bloom filter data structure potentially A counting filter is essentially a bloom filter that’s had its single-bit booleans replaced with n-bit integers. This makes the filter take up much more space than a standard bloom filter, but in return we get an upper-bound count for insertions of a particular element and can remove elements from the filter. Though removal of elements can be pretty neat, we expose ourselves to the possibility of false negatives if we remove an element that was never inserted into the counting filter. Just be careful with that- Deke Guo has more to say about this than I do. Implementation I designed my counting filter class with the following interface: // Adds an element to the counting filter.void Add(const T *key, int size = sizeof(T)); // Removes an element from the counting filter.void Remove(const T *key, int size = sizeof(T)); // Test whether the element has been added. If false, the element// is definitely not in the set. If true, the element could be in// the set or it is a false positive as described above.bool MaybeContains(const T *key, int size = sizeof(T)) const; // Get the upper bound on the number of times an element could// have been inserted into the counting filter.int CountUpperBound(const T *key, int size = sizeof(T)) const; The counters are stored in a std::vector like so: std::vector<uint8> counters_; I initially implemented this with the counters stored in a std::array object. Using the array, the filter benchmarked at ~1.35x the runtime of a std::unordered_set insertion on my machine. With the vector, it’s benchmarking at ~2.05x std::unordered_set insertions. I settled on the slower option because the code looks cleaner and it allows us to specify the size and number of hashes in the bloom filter during instantiation without needing to use template arguments (std::arrays must have a size known at compile time). The constructor is pretty simple. Simply an up-front allocation of the counters_ vector via a call to resize() and perhaps just fill the vector with zeros, so I should start with showing how I use hashes to get index pairs: template <typename T, int64_t kSize, int32_t kNumHashPairs>void CountingFilter<T, kSize, kNumHashPairs>::IdxFromKey(const T *key,const int size,const uint32_t seed,int64_t *idx1,int64_t *idx2) const { array<uint64_t, 2> results;MurmurHash3_x64_128(key, size, seed, results.data());*idx1 = results[0] % counters_.size();*idx2 = results[1] % counters_.size();assert(*idx1 < counters_.size());assert(*idx2 < counters_.size());} Note that this limits the implementation to even-numbers of hashes, but I like the fact that we’re getting a 2-for-1 deal with the 64-bit hashes. The Add() function has to account for the possibility of the count potentially exceeding the maximum counter value. In this case, there’s not much we can do except avoid an increment and flag the filter data as potentially erroneous. template <typename T, int64_t kSize, int32_t kNumHashPairs>void CountingFilter<T, kSize, kNumHashPairs>::Add(const T *key, const int size) { for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {int64_t idx1, idx2;IdxFromKey(key, size, xx, &idx1, &idx2);// It's possible that the count can exceed the maximum uint8// value, so we'll just leave it be. After many removals,// this could result in a false negative, but this is very// unlikely. Let's just assert for this case.assert(counters_[idx1] <= numeric_limits<uint8_t>::max());assert(counters_[idx2] <= numeric_limits<uint8_t>::max());counters_[idx1] += 1;counters_[idx2] += 1;}++num_insertions_;} Removal is much simpler since we just decrement the counters. In this case, it’s up to you whether to add an assertion that the counters is nonzero prior to decrementing the counters: template <typename T, int64_t kSize, int32_t kNumHashPairs>void CountingFilter<T, kSize, kNumHashPairs>::Remove(const T *key, const int size) { for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {int64_t idx1, idx2;IdxFromKey(key, size, xx, &idx1, &idx2);assert(counters_[idx1] > 0);assert(counters_[idx2] > 0);counters_[idx1] -= 1;counters_[idx2] -= 1;}--num_insertions_;} Checking whether elements exist in the filter is just a check for all counters to be nonzero. Getting an upper-bound on the number of insertions an element has had into the filter requires us to just find the minimum counter. This is an upper-bound because collisions can potentially throw our calculation off. template <typename T, int64_t kSize, int32_t kNumHashPairs>int CountingFilter<T, kSize, kNumHashPairs>::CountUpperBound(const T *key, const int size) const { int count_ub = numeric_limits<uint8_t>::max() + 1; for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {int64_t idx1, idx2;IdxFromKey(key, size, xx, &idx1, &idx2);count_ub = min(static_cast<int>(counters_[idx1]), count_ub);count_ub = min(static_cast<int>(counters_[idx2]), count_ub);}return count_ub;} Wrapping up This data structure is a fun variation on a vanilla bloom filter. My implementation is currently benchmarking at ~2.05x the insertion time of a pre-reserved std::unordered_set. I’ve mainly used it to be pickier about cache insertions when the working set of data is very large, but I’d love to find more applications for this. is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising & sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories

Counting Bloom Filter in C++

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Tips for Using Diagrams to Ace The System Design Interview

The 10 Computer Scientists That Made Computers Mainstream

10 Data Structure & Algorithms Books Every Programmer Should Read

10 Best Object-Oriented Online Programming and Design Courses 2020 [Updated]

19 Apps and Websites All Student Developers Should Check Out

154 Stories To Learn About Computer Science

10 Tips for Using Diagrams to Ace The System Design Interview

The 10 Computer Scientists That Made Computers Mainstream

10 Data Structure & Algorithms Books Every Programmer Should Read

10 Best Object-Oriented Online Programming and Design Courses 2020 [Updated]

19 Apps and Websites All Student Developers Should Check Out

154 Stories To Learn About Computer Science

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps