Probabilistic Data Structures And Algorithms In Big Data

Shocker: Big Data derives its name not just from the size.

The datasets of Big Data are larger, more complex information bits fetched from new data sources. These massive volumes of data can be used to address business problems more intelligently. However, traditional data processing algorithms fall flat to handle this magnitude.

Deterministic data structures like HashSet do the trick with smaller amounts of data. But when we have to deal with something like streaming applications, those structures cannot process everything in one pass and assist in incremental updates.

That is why we need more space-efficient and fast algorithms. Thus, probabilistic data structures are a great fit for modern Big Data applications.

With that said, let’s have a look at probabilistic data structures and algorithms as well as their common use.

Deterministic Vs Probabilistic Data Structure

Deterministic data structures are common for a techie. Thus, we often bump into Array, List, HashTable, HashSet, etc. The latter is suitable for a wide variety of operations including insert, find, and delete (provided you have specific key values). As a result of such operations, we get deterministic or accurate results.

However, probabilistic data structures work according to their name. Probabilistic data structures cannot give you a definite answer, instead, they give you a reasonable approximation of the answer and a way to approximate that estimate.

How do they work?

These data structures are a great fit for a large data set. The most prominent examples of operations may include identifying some unique or frequent items. To complete the operation, probabilistic data structures use hash functions to randomize items.

Because they ignore collisions, they keep the size constant. Yet, this is also the reason why they cannot give you exact values. The higher the number of hash functions is, the more accurate determination you get.

The main use cases of probabilistic data structures include:

Huge datasets.
Statistical analysis.
Mining tera-bytes of data sets, and others.

Examples of probabilistic data structures are as follows:

Membership query (Bloom filter, Bloom count filter, private filter, cuckoo filter).
Power (linear counting, probabilistic counting, LogLog, HyperLogLogLog, HyperLogLog++).
Frequency (Counting sketch, Counting-minimal sketch).
Similarity (LSH, MinHash, SimHash), and others.

Let’s have a look at the most widely used data structures within this realm.

Bloom Filter

The Bloom filter is an implementation of a probability set, invented by Burton Bloom in 1970. This approximate member data query structure allows you to compactly store elements and check if a given element belongs to the set.

In this case, you can get a false positive (the element is not in the set, but the data structure says it is), but not a false negative. Bloom's filter can use any memory size, predefined by the user. And the larger it is, the lower the probability of a false positive.

The operation of adding new elements to the set is supported. However, you can’t delete the existing ones.

Bloom filter allows you to perform three kinds of operations:

add an item to the set
check whether the element belongs to the set
check whether the element does not belong to the set

When the structure flags the element as Found/Present, there is a small chance that it’s lying. But if we’re talking about the Not Found/Not Present category, the Bloom filter boasts 100% accuracy plus space-saving perks.

Its hype can be attributed to the fact that Bloom filters have this powerful combo of simplicity and multi-purpose nature. In layman’s terms, they support operations similar to the hash tables but use less space.

Apache Cassandra, for example, benefits from these structures to process massive amounts of information. This storage system taps into Bloom filters to find out whether an SSTable has data for a specific partition.

HyperLogLog

HyperLogLog is a beautiful, yet simple algorithm for handling cardinality estimation. It excels when dealing with sets of data with a huge number of values.

Let’s say, you have a massive dataset of elements with duplicate entries. The latter is taken from a set of cardinality n and you are required to find n, which is the number of unique components in the set.

This can be helpful when identifying the amount of Google searches performed by end-users in a day. If you try to squeeze all the data into the memory, you’ll need storage proportionate to the number of Google searches per day.

Thereby, the HyperLogLog data structure turns the data into a hash of random numbers that represent the data's cardinality, allowing it to solve the problem with as little as 1.5kB of RAM.g

HyperLogLog operates by predicting an approximate count of distinct elements through a function known as APPROX_DISTINCT.

Operations:

Insert: add an element to the data structure
Merge: generate a structure which is a combination of two structures
Cardinality: calculate the total number of elements inserted

Count-min Sketch

The counting-minimal sketch is another efficient algorithm for counting streaming data. This data structure boils down to keeping track of the count of things. Therefore, by performing this algorithm, you can find out how many times an element is met in the set. You can also easily test if a given member has been observed before.

Just like with Bloom Filters, Count-min sketch saves a lot of space by using probabilistic techniques. To implement a counting mechanism, you need to use a hash function.

Overall, the count-min structure works great whenever you’re looking for just approximate counts of the most essential elements.

Operations:

Insert: add an element to the data structure
Merge: generate a structure which is a combination of two structures
Cardinality: estimate the number of times a specific estimate was inserted

As for the prominent applications, AT&T leverages the structure in network switches to analyze traffic in memory-constrained environments. The structure is also implemented as part of Twitter's Algebird library.

To Sum It Up

Working with Big Data is a challenge itself, let alone digging up answers from it. In this case, the best you can count on is an approximate answer. Probabilistic data structures allow you to conquer the beast and give you an estimated view of some data characteristics. You lose the accuracy of results, but get an enormous amount of storage space in exchange for that.