skip to Main Content

I have very simple table:

create table tmp(
    id int,
    source_id int,
    event_user varchar,
    event_date timestamp
);
 

and here is the data:

insert into tmp(id, source_id, event_user, event_date)
values (1, 1,  'A', now()),
       (2, 1,  'A', now()+interval '1 day'),
       (3, 1,  'B', now()+interval '2 day'),
       (4, 1,  'B', now()+interval '3 day'),
       (5, 1,  'A', now()+interval '4 day'),
       (6, 1,  'A', now()+interval '5 day'),
       (7, 1,  'A', now()+interval '6 day'),
       (8, 2,  'A', now()+interval '7 day'),
       (9, 2,  'B', now()+interval '8 day'),
       (10, 2,  'A', now()+interval '9 day'),
       (11, 2,  'B', now()+interval '10 day'),
       (12, 2,  'B', now()+interval '11 day') ;

(Here for the sake of simplicity I just added 1 day to ‘now’ timestamp in column event_date to show time sequence. In real case no any assumption about distribution of event_date could be made. Also columns ‘id’ and ‘source_id’ in real case are of type uuid.)

When I query this table:

select * from tmp
order by event_date;

I get:
initial data

id source_id event_user event_date
1 1 A 03-05-2024
2 1 A 04-05-2024
3 1 B 05-05-2024
4 1 B 06-05-2024
5 1 A 07-05-2024
6 1 A 08-05-2024
7 1 A 09-05-2024
8 2 A 10-05-2024
9 2 B 11-05-2024
10 2 A 12-05-2024
11 2 B 13-05-2024
12 2 B 14-05-2024

Now, and here is the question, I need additional column (series_id) which identifies the groups of sequential (in order of event_date) records belonging to the same source_id and event_user.

so when I’ll query I would like to get the following result:

expected result

id source_id SERIES_ID event_user event_date
1 1 1 A 03-05-2024
2 1 1 A 04-05-2024
3 1 2 B 05-05-2024
4 1 2 B 06-05-2024
5 1 3 A 07-05-2024
6 1 3 A 08-05-2024
7 1 3 A 09-05-2024
8 2 1 A 10-05-2024
9 2 2 B 11-05-2024
10 2 3 A 12-05-2024
11 2 4 B 13-05-2024
12 2 4 B 14-05-2024

I’ve tried window functions (rank(), dense_rank()), but didn’t succeed, sure because records in expected groups do not have no common features except they’re ‘consequent’ in time. May be it’s a kind of ‘gaps and islands’ problem, but unfortunately I cant catch how to solve it.

2

Answers


  1. Your suspicion is correct, this is a gaps and islands problem. Below is one approach to get the described results:

    WITH
      tmp (id, source_id, event_user, event_date) AS (
        VALUES
          (1, 1, 'A', NOW()),
          (2, 1, 'A', NOW() + INTERVAL '1 day'),
          (3, 1, 'B', NOW() + INTERVAL '2 day'),
          (4, 1, 'B', NOW() + INTERVAL '3 day'),
          (5, 1, 'A', NOW() + INTERVAL '4 day'),
          (6, 1, 'A', NOW() + INTERVAL '5 day'),
          (7, 1, 'A', NOW() + INTERVAL '6 day'),
          (8, 2, 'A', NOW() + INTERVAL '7 day'),
          (9, 2, 'B', NOW() + INTERVAL '8 day'),
          (10, 2, 'A', NOW() + INTERVAL '9 day'),
          (11, 2, 'B', NOW() + INTERVAL '10 day'),
          (12, 2, 'B', NOW() + INTERVAL '11 day')
      ),
      breaks AS (
        SELECT
          id,
          event_user IS DISTINCT FROM LAG(event_user) OVER (
            PARTITION BY
              source_id
            ORDER BY
              event_date,
              event_user
          ) break,
          source_id,
          event_user,
          event_date
        FROM
          tmp
      )
    SELECT
      id,
      source_id,
      COUNT(break) FILTER (
        WHERE
          break
      ) OVER (
        PARTITION BY
          source_id
        ORDER BY
          event_date
      ) AS series_id,
      event_user,
      event_date
    FROM
      breaks
    ORDER BY
      event_date;
    

    This approach determines which rows start a new series and then assigns the series by counting the number of series starts within each source_id.

    Login or Signup to reply.
  2. The problem can be solved with slightly more concise query than JohnH‘s one.

    select
      id,
      source_id,
      sum(break) over(partition by source_id
                      order by event_date, id) + 1 as serie_id,
      event_user,
      event_date
    from (select
            *,
            (coalesce(lag(event_user) over(partition by source_id
                                           order by event_date, id),
                      event_user) <> event_user)::int as break
          from
            tmp)
    ;
    

    db<>fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search