I have very simple table:
create table tmp(
id int,
source_id int,
event_user varchar,
event_date timestamp
);
and here is the data:
insert into tmp(id, source_id, event_user, event_date)
values (1, 1, 'A', now()),
(2, 1, 'A', now()+interval '1 day'),
(3, 1, 'B', now()+interval '2 day'),
(4, 1, 'B', now()+interval '3 day'),
(5, 1, 'A', now()+interval '4 day'),
(6, 1, 'A', now()+interval '5 day'),
(7, 1, 'A', now()+interval '6 day'),
(8, 2, 'A', now()+interval '7 day'),
(9, 2, 'B', now()+interval '8 day'),
(10, 2, 'A', now()+interval '9 day'),
(11, 2, 'B', now()+interval '10 day'),
(12, 2, 'B', now()+interval '11 day') ;
(Here for the sake of simplicity I just added 1 day to ‘now’ timestamp in column event_date to show time sequence. In real case no any assumption about distribution of event_date could be made. Also columns ‘id’ and ‘source_id’ in real case are of type uuid.)
When I query this table:
select * from tmp
order by event_date;
I get:
initial data
id | source_id | event_user | event_date |
---|---|---|---|
1 | 1 | A | 03-05-2024 |
2 | 1 | A | 04-05-2024 |
3 | 1 | B | 05-05-2024 |
4 | 1 | B | 06-05-2024 |
5 | 1 | A | 07-05-2024 |
6 | 1 | A | 08-05-2024 |
7 | 1 | A | 09-05-2024 |
8 | 2 | A | 10-05-2024 |
9 | 2 | B | 11-05-2024 |
10 | 2 | A | 12-05-2024 |
11 | 2 | B | 13-05-2024 |
12 | 2 | B | 14-05-2024 |
Now, and here is the question, I need additional column (series_id) which identifies the groups of sequential (in order of event_date) records belonging to the same source_id and event_user.
so when I’ll query I would like to get the following result:
id | source_id | SERIES_ID | event_user | event_date |
---|---|---|---|---|
1 | 1 | 1 | A | 03-05-2024 |
2 | 1 | 1 | A | 04-05-2024 |
3 | 1 | 2 | B | 05-05-2024 |
4 | 1 | 2 | B | 06-05-2024 |
5 | 1 | 3 | A | 07-05-2024 |
6 | 1 | 3 | A | 08-05-2024 |
7 | 1 | 3 | A | 09-05-2024 |
8 | 2 | 1 | A | 10-05-2024 |
9 | 2 | 2 | B | 11-05-2024 |
10 | 2 | 3 | A | 12-05-2024 |
11 | 2 | 4 | B | 13-05-2024 |
12 | 2 | 4 | B | 14-05-2024 |
I’ve tried window functions (rank(), dense_rank()), but didn’t succeed, sure because records in expected groups do not have no common features except they’re ‘consequent’ in time. May be it’s a kind of ‘gaps and islands’ problem, but unfortunately I cant catch how to solve it.
2
Answers
Your suspicion is correct, this is a gaps and islands problem. Below is one approach to get the described results:
This approach determines which rows start a new series and then assigns the series by counting the number of series starts within each
source_id
.The problem can be solved with slightly more concise query than JohnH‘s one.
db<>fiddle