Nowadays, every project collects analytics data. Therefore, it can be easy to understand users and their needs based on the data. For example, one of the everyday tasks in this area is to count unique visits to web pages.
Let’s imagine that a popular media resource is being developed. The website traffic is approximately equal to 500 million unique visitors per day. And there is a task to cache the number of visits of each page in Redis with the ability to write/read as fast as possible and obtain general statistics for multiple pages. An IP address identifies each unique visit.
At first, you can use the built-in Sets data structure in Redis. "Sets" is a data structure with unique values and useful functions to count intersections.
127.0.0.1:6379> sadd page:1 126.96.36.199 188.8.131.52 184.108.40.206 (integer) 3 127.0.0.1:6379> sadd page:2 220.127.116.11 18.104.22.168 (integer) 2 127.0.0.1:6379> sinter page:1 page:2 1) "22.214.171.124"
The Sets data structure seems to be an excellent solution for the case. But it’s not. Redis Sets can be used only in small or medium projects. Considering that the task includes 500 million visits per day, the resource is under a high load. To store all the data in Sets, you need a lot of RAM. Also, Redis would consume a huge amount of time to process millions of items.
Fortunately, Redis has the HyperLogLog data structure to store many unique events, and it takes up a constant amount of memory. In addition, HyperLogLog is a probabilistic structure, which means that with a large data set, the count of the number of elements can have an error of up to 0.81%.
To write data to HyperLogLog, use the
pfadd key [element [element ...]] command:
127.0.0.1:6379> pfadd page:1 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 (integer) 1 127.0.0.1:6379> pfadd page:2 22.214.171.124 243.171.182.196 126.96.36.199 188.8.131.52 184.108.40.206 (integer) 1 127.0.0.1:6379> pfadd page:3 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 (integer) 1
If new values are successfully written, one is returned. However, if you try to insert an existing value, 0 will be returned:
127.0.0.1:6379> pfadd page:1 184.108.40.206 (integer) 0
To get the number of unique visitors, use the
pfcount key [key ...] command:
127.0.0.1:6379> pfcount page:1 (integer) 5 127.0.0.1:6379> pfcount page:2 (integer) 5 127.0.0.1:6379> pfcount page:3 (integer) 5
You can calculate the number of unique visitors to several pages with the
pfmerge destkey sourcekey [sourcekey ...] command:
127.0.0.1:6379> pfmerge pages page:1 page:2 page:3 OK 127.0.0.1:6379> pfcount pages (integer) 9
pfmerge command merges several HyperLogLog keys into a single one. The merge result has been stored in the
pfcountcommand can calculate the HyperLogLog cardinality