How to use Redis HyperLogLog by@vgukasov

How to use Redis HyperLogLog

image
Vladislav Gukasov HackerNoon profile picture

Vladislav Gukasov

Senior SWE at Akma Trading

Nowadays, every project collects analytics data. Therefore, it can be easy to understand users and their needs based on the data. For example, one of the everyday tasks in this area is to count unique visits to web pages.

Let’s imagine that a popular media resource is being developed. The website traffic is approximately equal to 500 million unique visitors per day. And there is a task to cache the number of visits of each page in Redis with the ability to write/read as fast as possible and obtain general statistics for multiple pages. An IP address identifies each unique visit.

image

Redis Sets

At first, you can use the built-in Sets data structure in Redis. "Sets" is a data structure with unique values and useful functions to count intersections.

127.0.0.1:6379> sadd page:1 113.145.236.211 159.54.101.236 207.47.30.26
(integer) 3
127.0.0.1:6379> sadd page:2 113.145.236.211 36.186.119.48
(integer) 2
127.0.0.1:6379> sinter page:1 page:2
1) "113.145.236.211"

The Sets data structure seems to be an excellent solution for the case. But it’s not. Redis Sets can be used only in small or medium projects. Considering that the task includes 500 million visits per day, the resource is under a high load. To store all the data in Sets, you need a lot of RAM. Also, Redis would consume a huge amount of time to process millions of items.

Redis HyperLogLog

Fortunately, Redis has the HyperLogLog data structure to store many unique events, and it takes up a constant amount of memory. In addition, HyperLogLog is a probabilistic structure, which means that with a large data set, the count of the number of elements can have an error of up to 0.81%.

Data writing

To write data to HyperLogLog, use the pfadd key [element [element ...]] command:

127.0.0.1:6379> pfadd page:1 158.58.0.86 148.240.139.178 74.81.90.212 33.244.76.56 23.83.156.65
(integer) 1
127.0.0.1:6379> pfadd page:2 41.64.240.230 243.171.182.196 74.81.90.212 33.244.76.56 23.83.156.65
(integer) 1
127.0.0.1:6379> pfadd page:3 158.58.0.86 148.240.139.178 74.81.90.212 225.109.160.131 85.83.185.103
(integer) 1

If new values are successfully written, one is returned. However, if you try to insert an existing value, 0 will be returned:

127.0.0.1:6379> pfadd page:1 158.58.0.86
(integer) 0

Data reading

To get the number of unique visitors, use the pfcount key [key ...] command:

127.0.0.1:6379> pfcount page:1
(integer) 5
127.0.0.1:6379> pfcount page:2
(integer) 5
127.0.0.1:6379> pfcount page:3
(integer) 5

You can calculate the number of unique visitors to several pages with the pfmerge destkey sourcekey [sourcekey ...] command:

127.0.0.1:6379> pfmerge pages page:1 page:2 page:3
OK

127.0.0.1:6379> pfcount pages
(integer) 9

The pfmerge command merges several HyperLogLog keys into a single one. The merge result has been stored in the pages key.

Conclusion

  • Use Redis Sets to count unique events, but not when there is a lot of data.
  • Redis HyperLogLog is a probabilistic data structure that efficiently stores and reads a large number of unique events
  • To add data to HyperLogLog, use the pfadd command
  • The pfcount command can calculate the HyperLogLog cardinality
  • User can merge multiple HyperLogLog structures into a single one by pfmerge

Comments

Signup or Login to Join the Discussion

Tags

Related Stories