skip to Main Content

I have a Postgres database. It has two tables that we’re concerned with right now. Users has a Name field and an ID field, and Events has a UserID field (FK to Users.ID), EventType field (typed as integer), and Time field (typed as datetime).

I want to find, for each user, the longest streak of consecutive days in which they have generated at least one event of type 3. The answer should return one row for each user, with the user’s name, the date on which the longest streak began, and the number of days in the streak. Joining the tables to get the name is trivial, but I can’t figure out any good way to determine the streaks. It feels like it should be an aggregate, but there aren’t any applicable aggregates that I can see.

2

Answers


  1. The solution for this is based on the following steps

    Step 1: Order the data in asc order for each user and assign row_number

    Step 2: calculate the estimated start_date by substracting row_number from the date

    Step 3: group by estimated start_date and the user to get the consicutive number of days for which the user has loggedIn

    with base_data as (
     select u.ID,u.Name,e.event_date from Users u
     join 
     (select UserID, cast(Time as date) as event_date from Events 
       where EventType=3  ) e
     on u.ID = e.UserID
    ),
    ordered_data as ( 
      select 
        ID,Name,event_date,
        row_number() over(partition by ID,order by event_date ) as row_number
      from base_data
    ),
    grouped_date_data as (
      select 
         ID,
         Name,
         DATE_SUB(event_date, INTERVAL row_num DAY) AS group_date
      from ordered_data),
    
    select ID,Name,max(group_date),count(*) as consicutive_days
    from grouped_date_data
    group by ID,Name
    
    
    Login or Signup to reply.
  2. The general approach for "gaps and islands" problems is to identify the beginning of each island and then attach a group identification tag to each row that is part of the island. These tags facilitate additional operations on the data in each island.

    The following creates the demonstration tables:

    CREATE TABLE users (
      id BIGINT PRIMARY KEY,
      name TEXT NOT NULL
    );
    
    CREATE TABLE events (
      id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
      user_id INTEGER NOT NULL REFERENCES users(id),
      event_ts TIMESTAMPTZ NOT NULL,
      event_type INTEGER
    );
    

    These tables are populated with random events with the following block (the parameters were chosen to give a reasonable probability that at least one user would have more than one longest run with a data set that is small enough to be manually explored):

    DO LANGUAGE plpgsql
    $BLOCK$
      <<local>>
      DECLARE
        num_event_types    CONSTANT INTEGER := 3;
        num_events         CONSTANT INTEGER := 120;
        num_events_per_day CONSTANT INTEGER := 12;
        num_users          CONSTANT INTEGER := 3;
        event_interval     CONSTANT INTERVAL := INTERVAL '1' DAY / num_events_per_day;
      BEGIN
        INSERT INTO users (id, name)
        SELECT gs.n, 'user_' || gs.n
          FROM GENERATE_SERIES(1, local.num_users) gs (n);
    
        INSERT INTO events(user_id, event_ts, event_type)
        SELECT FLOOR(RANDOM() * local.num_users)::INTEGER + 1 AS user_id,
               '2024-01-01'::TIMESTAMP + event_interval * n AS event_ts,
               FLOOR(RANDOM() * local.num_event_types)::INTEGER + 1 AS event_type
          FROM GENERATE_SERIES(1, local.num_events) gs (n);
      END;
    $BLOCK$;
    

    The following query demonstrates one approach to achieve the desired results (I’ve added a column to report the number of selected events even though it’s not required):

    WITH
      user_event_dates AS (
        SELECT user_id, event_ts::date AS event_date, COUNT(*) AS num_events
          FROM events
          WHERE event_type = 3
          GROUP BY user_id, event_date
      ),
      group_increments AS (
        SELECT user_id,
               event_date,
               num_events,
               CASE WHEN (LAG(event_date)
                          OVER (PARTITION BY user_id ORDER BY event_date) < event_date - 1) IS NOT FALSE THEN 1
                    ELSE 0 END AS group_increment
          FROM user_event_dates
      ),
      user_event_groups AS (
        SELECT user_id, event_date, num_events, SUM(group_increment) OVER (ORDER BY user_id, event_date) AS group_id
          FROM group_increments)
    SELECT DISTINCT ON (user_event_groups.user_id) user_event_groups.user_id,
           users.name,
           MIN(user_event_groups.event_date) AS start_date,
           COUNT(*) AS num_days,
           SUM(user_event_groups.num_events) AS num_events
      FROM user_event_groups
      JOIN users
        ON users.id = user_event_groups.user_id
      GROUP BY user_event_groups.user_id, users.name, user_event_groups.group_id
      ORDER BY user_event_groups.user_id, num_days DESC, start_date;
    

    The first CTE (user_event_dates) aggregates the qualifying events by day for each user. The next CTE (group_increments) sets an increment value for each user day that starts a new contiguous series. The third CTE (user_event_groups) tags each user event day with a group number. The main query then selects the earliest longest run for each user. Changing the order on start_dateto be DESC will cause the query to return the most recent longest run. Similarly, changing the order to ASC on num_days, will return a shortest run for each user.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search