Postgresql - Is there any good way to determine streaks of consecutive events in SQL?

MasonWheeler
July 10, 2024
84 views
0 votes
2 Answers

I have a Postgres database. It has two tables that we’re concerned with right now. Users has a Name field and an ID field, and Events has a UserID field (FK to Users.ID), EventType field (typed as integer), and Time field (typed as datetime).

I want to find, for each user, the longest streak of consecutive days in which they have generated at least one event of type 3. The answer should return one row for each user, with the user’s name, the date on which the longest streak began, and the number of days in the streak. Joining the tables to get the name is trivial, but I can’t figure out any good way to determine the streaks. It feels like it should be an aggregate, but there aren’t any applicable aggregates that I can see.

Answers

- YESHWANTHLN
- July 9, 2024 at 1:12 pm
- 0 votes
0
The solution for this is based on the following steps

Step 1: Order the data in asc order for each user and assign row_number

Step 2: calculate the estimated start_date by substracting row_number from the date

Step 3: group by estimated start_date and the user to get the consicutive number of days for which the user has loggedIn
```
with base_data as (
 select u.ID,u.Name,e.event_date from Users u
 join 
 (select UserID, cast(Time as date) as event_date from Events 
   where EventType=3  ) e
 on u.ID = e.UserID
),
ordered_data as ( 
  select 
    ID,Name,event_date,
    row_number() over(partition by ID,order by event_date ) as row_number
  from base_data
),
grouped_date_data as (
  select 
     ID,
     Name,
     DATE_SUB(event_date, INTERVAL row_num DAY) AS group_date
  from ordered_data),

select ID,Name,max(group_date),count(*) as consicutive_days
from grouped_date_data
group by ID,Name
```
Login or Signup to reply.

The general approach for "gaps and islands" problems is to identify the beginning of each island and then attach a group identification tag to each row that is part of the island. These tags facilitate additional operations on the data in each island.

The following creates the demonstration tables:

CREATE TABLE users (
  id BIGINT PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE events (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  user_id INTEGER NOT NULL REFERENCES users(id),
  event_ts TIMESTAMPTZ NOT NULL,
  event_type INTEGER
);

These tables are populated with random events with the following block (the parameters were chosen to give a reasonable probability that at least one user would have more than one longest run with a data set that is small enough to be manually explored):

DO LANGUAGE plpgsql
$BLOCK$
  <<local>>
  DECLARE
    num_event_types    CONSTANT INTEGER := 3;
    num_events         CONSTANT INTEGER := 120;
    num_events_per_day CONSTANT INTEGER := 12;
    num_users          CONSTANT INTEGER := 3;
    event_interval     CONSTANT INTERVAL := INTERVAL '1' DAY / num_events_per_day;
  BEGIN
    INSERT INTO users (id, name)
    SELECT gs.n, 'user_' || gs.n
      FROM GENERATE_SERIES(1, local.num_users) gs (n);

    INSERT INTO events(user_id, event_ts, event_type)
    SELECT FLOOR(RANDOM() * local.num_users)::INTEGER + 1 AS user_id,
           '2024-01-01'::TIMESTAMP + event_interval * n AS event_ts,
           FLOOR(RANDOM() * local.num_event_types)::INTEGER + 1 AS event_type
      FROM GENERATE_SERIES(1, local.num_events) gs (n);
  END;
$BLOCK$;

The following query demonstrates one approach to achieve the desired results (I’ve added a column to report the number of selected events even though it’s not required):

WITH
  user_event_dates AS (
    SELECT user_id, event_ts::date AS event_date, COUNT(*) AS num_events
      FROM events
      WHERE event_type = 3
      GROUP BY user_id, event_date
  ),
  group_increments AS (
    SELECT user_id,
           event_date,
           num_events,
           CASE WHEN (LAG(event_date)
                      OVER (PARTITION BY user_id ORDER BY event_date) < event_date - 1) IS NOT FALSE THEN 1
                ELSE 0 END AS group_increment
      FROM user_event_dates
  ),
  user_event_groups AS (
    SELECT user_id, event_date, num_events, SUM(group_increment) OVER (ORDER BY user_id, event_date) AS group_id
      FROM group_increments)
SELECT DISTINCT ON (user_event_groups.user_id) user_event_groups.user_id,
       users.name,
       MIN(user_event_groups.event_date) AS start_date,
       COUNT(*) AS num_days,
       SUM(user_event_groups.num_events) AS num_events
  FROM user_event_groups
  JOIN users
    ON users.id = user_event_groups.user_id
  GROUP BY user_event_groups.user_id, users.name, user_event_groups.group_id
  ORDER BY user_event_groups.user_id, num_days DESC, start_date;

The first CTE (user_event_dates) aggregates the qualifying events by day for each user. The next CTE (group_increments) sets an increment value for each user day that starts a new contiguous series. The third CTE (user_event_groups) tags each user event day with a group number. The main query then selects the earliest longest run for each user. Changing the order on start_dateto be DESC will cause the query to return the most recent longest run. Similarly, changing the order to ASC on num_days, will return a shortest run for each user.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Is there any good way to determine streaks of consecutive events in SQL?

Answers