Amazon web services - How can non-additive metrics, such as 'unique visitors,' be calculated with low latency for custom date ranges?

Alex
September 24, 2024
56 views
0 votes
2 Answers

I’m building an analytical solution with a worker to calculate 30-day statistics and a UI to display them.

Currently, the UI uses a fixed last 30 days date range, but I want to support custom date ranges with millisecond response times.

Additive metrics like video_views can be pre-calculated daily and summed for any date range.

However, non-additive metrics, such as unique_videos and unique_visitors, require a different approach since they need to account for unique values.

How can I handle non-additive metrics efficiently?

Notes:

20 million daily active users
50 million daily events
Current solution based on AWS (ECS, Redshift, RDS)
Raw data is clickstream

Answers

- AdrianK
- September 24, 2024 at 5:54 am
- 0 votes
0
I don’t have experience designing systems with that level of activity, so I’ll be interested to see what other ideas are put forward.
- Gather data: when a given event occurs (new session, or new video added) capture the essential info and throw it in a queue to be processed. Could be one queue or one queue per event-type, whatever you think will be best.
- Data structure / solution: use a NoSQL database of some kind, like DynamoDB. Have a "table" per event of interest.
- Processing: Process items in the queue(s), adding 1 new entry/record per event.
- Processing option: employ a cache of some kind – update that at the same time you process the each result, using a caching pattern like write-aside, write-behind.
To get the value-counts, some options:
- Just query as needed, from the database.
- Query from the cache.
- Pre-compile value-counts periodically (e.g. every 5 mins) – assuming that is functionally acceptable.
Login or Signup to reply.

- PeterCsala
- September 24, 2024 at 3:17 pm
- 0 votes
0
We have used the following technique to be able to tell how many unique sessions were in any arbitrary time window. So, each session has a start and an end. Or more precisely a first seen and a last seen attribute. When we saw a new session id then we created a new record with the first seen in a database. In every minute we have updated the active sessions’ last seen attribute with current timestamp.

We have used Aurora Postgres database due to its time range overlap support.

Here is a simplified version of the table schema:
```
CREATE TABLE IF NOT EXISTS unique_sessions(
    session_id CHAR(36) NOT NULL, -- UUID with hyphens
    duration tsrange,
    -- rest of the dimensions
);
```
The only thing that is worth calling out is the usage of the tsrange type.

Here is a simplified upsert stored procedure:
```
CREATE OR REPLACE PROCEDURE unique_session_upsert(
    sessionId CHAR(36),
    lastSeen TIMESTAMP(0) without time zone
    ... -- dimensions
) LANGUAGE plpgsql
    AS $BODY$
    DECLARE
    BEGIN
        INSERT INTO unique_sessions(session_id, duration, ...)
        VALUES (sessionId, tsrange(lastSeen - interval '1 minute', lastSeen), date_trunc('day', lastSeen)), ...)
        ON CONFLICT(session_id) DO UPDATE
            SET duration = tsrange(least(lower(unique_sessions.duration), lastSeen), greatest(upper(unique_sessions.duration), lastSeen))
            WHERE excluded.session_id = sessionId;
    END; $BODY$
```
And of course you have to create a gist index to make the overlap operator efficient:
```
CREATE INDEX unique_sessions_idx ON unique_sessions USING GIST (duration);
```
and finally the query:
```
SELECT count(1) FROM unique_sessions
WHERE duration && tsrange(query_start, query_end, '[]')
-- AND rest of the dimension filters 
```
You might need to consider to partition your table (based on the day) to make the table and the query scalable.
Login or Signup to reply.