Nowadays, every project collects analytics data. Therefore, it can be easy to understand users and their needs based on the data. For example, one of the everyday tasks in this area is to count unique visits to web pages.
Let’s imagine that a popular media resource is being developed. The website traffic is approximately equal to 500 million unique visitors per day. And there is a task to cache the number of visits of each page in Redis with the ability to write/read as fast as possible and obtain general statistics for multiple pages. An IP address identifies each unique visit.
At first, you can use the built-in Sets data structure in Redis. "Sets" is a data structure with unique values and useful functions to count intersections.
127.0.0.1:6379> sadd page:1 113.145.236.211 159.54.101.236 207.47.30.26
(integer) 3
127.0.0.1:6379> sadd page:2 113.145.236.211 36.186.119.48
(integer) 2
127.0.0.1:6379> sinter page:1 page:2
1) "113.145.236.211"
The Sets data structure seems to be an excellent solution for the case. But it’s not. Redis Sets can be used only in small or medium projects. Considering that the task includes 500 million visits per day, the resource is under a high load. To store all the data in Sets, you need a lot of RAM. Also, Redis would consume a huge amount of time to process millions of items.
Fortunately, Redis has the HyperLogLog data structure to store many unique events, and it takes up a constant amount of memory. In addition, HyperLogLog is a probabilistic structure, which means that with a large data set, the count of the number of elements can have an error of up to 0.81%.
To write data to HyperLogLog, use the pfadd key [element [element ...]]
command:
127.0.0.1:6379> pfadd page:1 158.58.0.86 148.240.139.178 74.81.90.212 33.244.76.56 23.83.156.65
(integer) 1
127.0.0.1:6379> pfadd page:2 41.64.240.230 243.171.182.196 74.81.90.212 33.244.76.56 23.83.156.65
(integer) 1
127.0.0.1:6379> pfadd page:3 158.58.0.86 148.240.139.178 74.81.90.212 225.109.160.131 85.83.185.103
(integer) 1
If new values are successfully written, one is returned. However, if you try to insert an existing value, 0 will be returned:
127.0.0.1:6379> pfadd page:1 158.58.0.86
(integer) 0
To get the number of unique visitors, use the pfcount key [key ...]
command:
127.0.0.1:6379> pfcount page:1
(integer) 5
127.0.0.1:6379> pfcount page:2
(integer) 5
127.0.0.1:6379> pfcount page:3
(integer) 5
You can calculate the number of unique visitors to several pages with the pfmerge destkey sourcekey [sourcekey ...]
command:
127.0.0.1:6379> pfmerge pages page:1 page:2 page:3
OK
127.0.0.1:6379> pfcount pages
(integer) 9
The pfmerge
command merges several HyperLogLog keys into a single one. The merge result has been stored in the pages
key.
pfadd
commandpfcount
command can calculate the HyperLogLog cardinalitypfmerge