skip to Main Content

I’m building an analytical solution with a worker to calculate 30-day statistics and a UI to display them.

Currently, the UI uses a fixed last 30 days date range, but I want to support custom date ranges with millisecond response times.

Additive metrics like video_views can be pre-calculated daily and summed for any date range.

However, non-additive metrics, such as unique_videos and unique_visitors, require a different approach since they need to account for unique values.

How can I handle non-additive metrics efficiently?

Notes:

  • 20 million daily active users
  • 50 million daily events
  • Current solution based on AWS (ECS, Redshift, RDS)
  • Raw data is clickstream

2

Answers


  1. I don’t have experience designing systems with that level of activity, so I’ll be interested to see what other ideas are put forward.

    • Gather data: when a given event occurs (new session, or new video added) capture the essential info and throw it in a queue to be processed. Could be one queue or one queue per event-type, whatever you think will be best.
    • Data structure / solution: use a NoSQL database of some kind, like DynamoDB. Have a "table" per event of interest.
    • Processing: Process items in the queue(s), adding 1 new entry/record per event.
    • Processing option: employ a cache of some kind – update that at the same time you process the each result, using a caching pattern like write-aside, write-behind.

    To get the value-counts, some options:

    • Just query as needed, from the database.
    • Query from the cache.
    • Pre-compile value-counts periodically (e.g. every 5 mins) – assuming that is functionally acceptable.
    Login or Signup to reply.
  2. We have used the following technique to be able to tell how many unique sessions were in any arbitrary time window. So, each session has a start and an end. Or more precisely a first seen and a last seen attribute. When we saw a new session id then we created a new record with the first seen in a database. In every minute we have updated the active sessions’ last seen attribute with current timestamp.

    We have used Aurora Postgres database due to its time range overlap support.

    Here is a simplified version of the table schema:

    CREATE TABLE IF NOT EXISTS unique_sessions(
        session_id CHAR(36) NOT NULL, -- UUID with hyphens
        duration tsrange,
        -- rest of the dimensions
    );
    

    The only thing that is worth calling out is the usage of the tsrange type.

    Here is a simplified upsert stored procedure:

    CREATE OR REPLACE PROCEDURE unique_session_upsert(
        sessionId CHAR(36),
        lastSeen TIMESTAMP(0) without time zone
        ... -- dimensions
    ) LANGUAGE plpgsql
        AS $BODY$
        DECLARE
        BEGIN
            INSERT INTO unique_sessions(session_id, duration, ...)
            VALUES (sessionId, tsrange(lastSeen - interval '1 minute', lastSeen), date_trunc('day', lastSeen)), ...)
            ON CONFLICT(session_id) DO UPDATE
                SET duration = tsrange(least(lower(unique_sessions.duration), lastSeen), greatest(upper(unique_sessions.duration), lastSeen))
                WHERE excluded.session_id = sessionId;
        END; $BODY$
    

    And of course you have to create a gist index to make the overlap operator efficient:

    CREATE INDEX unique_sessions_idx ON unique_sessions USING GIST (duration);
    

    and finally the query:

    SELECT count(1) FROM unique_sessions
    WHERE duration && tsrange(query_start, query_end, '[]')
    -- AND rest of the dimension filters 
    

    You might need to consider to partition your table (based on the day) to make the table and the query scalable.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search