Postgresql - Identify groups of sequential records

badbeetleredbeetle
May 3, 2024
174 views
0 votes
2 Answers

I have very simple table:

create table tmp(
    id int,
    source_id int,
    event_user varchar,
    event_date timestamp
);

and here is the data:

insert into tmp(id, source_id, event_user, event_date)
values (1, 1,  'A', now()),
       (2, 1,  'A', now()+interval '1 day'),
       (3, 1,  'B', now()+interval '2 day'),
       (4, 1,  'B', now()+interval '3 day'),
       (5, 1,  'A', now()+interval '4 day'),
       (6, 1,  'A', now()+interval '5 day'),
       (7, 1,  'A', now()+interval '6 day'),
       (8, 2,  'A', now()+interval '7 day'),
       (9, 2,  'B', now()+interval '8 day'),
       (10, 2,  'A', now()+interval '9 day'),
       (11, 2,  'B', now()+interval '10 day'),
       (12, 2,  'B', now()+interval '11 day') ;

(Here for the sake of simplicity I just added 1 day to ‘now’ timestamp in column event_date to show time sequence. In real case no any assumption about distribution of event_date could be made. Also columns ‘id’ and ‘source_id’ in real case are of type uuid.)

When I query this table:

select * from tmp
order by event_date;

I get:
initial data

id	source_id	event_user	event_date
1	1	A	03-05-2024
2	1	A	04-05-2024
3	1	B	05-05-2024
4	1	B	06-05-2024
5	1	A	07-05-2024
6	1	A	08-05-2024
7	1	A	09-05-2024
8	2	A	10-05-2024
9	2	B	11-05-2024
10	2	A	12-05-2024
11	2	B	13-05-2024
12	2	B	14-05-2024

Now, and here is the question, I need additional column (series_id) which identifies the groups of sequential (in order of event_date) records belonging to the same source_id and event_user.

so when I’ll query I would like to get the following result:

expected result

id	source_id	SERIES_ID	event_user	event_date
1	1	1	A	03-05-2024
2	1	1	A	04-05-2024
3	1	2	B	05-05-2024
4	1	2	B	06-05-2024
5	1	3	A	07-05-2024
6	1	3	A	08-05-2024
7	1	3	A	09-05-2024
8	2	1	A	10-05-2024
9	2	2	B	11-05-2024
10	2	3	A	12-05-2024
11	2	4	B	13-05-2024
12	2	4	B	14-05-2024

I’ve tried window functions (rank(), dense_rank()), but didn’t succeed, sure because records in expected groups do not have no common features except they’re ‘consequent’ in time. May be it’s a kind of ‘gaps and islands’ problem, but unfortunately I cant catch how to solve it.

Answers

Your suspicion is correct, this is a gaps and islands problem. Below is one approach to get the described results:

WITH
  tmp (id, source_id, event_user, event_date) AS (
    VALUES
      (1, 1, 'A', NOW()),
      (2, 1, 'A', NOW() + INTERVAL '1 day'),
      (3, 1, 'B', NOW() + INTERVAL '2 day'),
      (4, 1, 'B', NOW() + INTERVAL '3 day'),
      (5, 1, 'A', NOW() + INTERVAL '4 day'),
      (6, 1, 'A', NOW() + INTERVAL '5 day'),
      (7, 1, 'A', NOW() + INTERVAL '6 day'),
      (8, 2, 'A', NOW() + INTERVAL '7 day'),
      (9, 2, 'B', NOW() + INTERVAL '8 day'),
      (10, 2, 'A', NOW() + INTERVAL '9 day'),
      (11, 2, 'B', NOW() + INTERVAL '10 day'),
      (12, 2, 'B', NOW() + INTERVAL '11 day')
  ),
  breaks AS (
    SELECT
      id,
      event_user IS DISTINCT FROM LAG(event_user) OVER (
        PARTITION BY
          source_id
        ORDER BY
          event_date,
          event_user
      ) break,
      source_id,
      event_user,
      event_date
    FROM
      tmp
  )
SELECT
  id,
  source_id,
  COUNT(break) FILTER (
    WHERE
      break
  ) OVER (
    PARTITION BY
      source_id
    ORDER BY
      event_date
  ) AS series_id,
  event_user,
  event_date
FROM
  breaks
ORDER BY
  event_date;

This approach determines which rows start a new series and then assigns the series by counting the number of series starts within each source_id.

The problem can be solved with slightly more concise query than JohnH‘s one.

select
  id,
  source_id,
  sum(break) over(partition by source_id
                  order by event_date, id) + 1 as serie_id,
  event_user,
  event_date
from (select
        *,
        (coalesce(lag(event_user) over(partition by source_id
                                       order by event_date, id),
                  event_user) <> event_user)::int as break
      from
        tmp)
;

db<>fiddle

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Identify groups of sequential records

Answers