I’m designing an algorithm to count unique users on a set of pages, based on a 60min sliding scale
So it needs to find unique IPs (or tokens) that have hit a particular page and total up those hits within the last 60 mins
I need this to be very fast at scale (mainly to write but reading is a bonus). We could have 10,000s of users per page multiplied by 1000s of pages.
My research is pointing me to using Redis with HyperLogLog
I’m new to Redis coming from a Memcache background. Could anyone give me any pointers?
Thanks
2
Answers
One way of doing this would be to keep an HLL key for each page/set of pages with a minute resolution. For example, if we’re tracking ‘index.html’ and the current timestamp is 0, a visitor with the ID ‘abc’ can be tracked by:
Once the minute had passed – i.e. timestamp 1 for simplicity – a visitor such as ‘def’ will be added to the next key:
And so forth. To count the number of unique visitors from the last 60 minutes, assuming the current timestamp 100, you’ll need to call the
PFCOUNT
command and provide it with the names of all of these 60 keys, e.g.:Note: if you want "old" counts to be evicted, call
EXPIRE
after each call toPFADD
.You can’t get time intervals in a single
HyperLogLog
key.Sorted set could be an option;
ZADD
.ZCOUNT
to get total number of unique users in that time interval. I used small numbers for timestamps for example.When you are using
ZCOUNT
, you will defineMIN
as (current time – (60*60)) andMAX
asinf
, so it will take between (now – 3600 seconds) and (now).One of the drawbacks for this one is, you need to remove old data from these sets manually via using
ZREMRANGEBYSCORE