I’m building an analytical solution with a worker to calculate 30-day statistics and a UI to display them.
Currently, the UI uses a fixed last 30 days
date range, but I want to support custom date ranges with millisecond response times.
Additive metrics like video_views
can be pre-calculated daily and summed for any date range.
However, non-additive metrics, such as unique_videos
and unique_visitors
, require a different approach since they need to account for unique values.
How can I handle non-additive metrics efficiently?
Notes:
- 20 million daily active users
- 50 million daily events
- Current solution based on AWS (ECS, Redshift, RDS)
- Raw data is clickstream
2
Answers
I don’t have experience designing systems with that level of activity, so I’ll be interested to see what other ideas are put forward.
To get the value-counts, some options:
We have used the following technique to be able to tell how many unique sessions were in any arbitrary time window. So, each session has a start and an end. Or more precisely a first seen and a last seen attribute. When we saw a new session id then we created a new record with the first seen in a database. In every minute we have updated the active sessions’ last seen attribute with current timestamp.
We have used Aurora Postgres database due to its time range overlap support.
Here is a simplified version of the table schema:
The only thing that is worth calling out is the usage of the
tsrange
type.Here is a simplified upsert stored procedure:
And of course you have to create a gist index to make the overlap operator efficient:
and finally the query:
You might need to consider to partition your table (based on the day) to make the table and the query scalable.