I am looking to group users into separate groups based on when they bought something.
A new group starts after it’s been more than 90 days when the user last bought an item.
Example of what I am looking for:
row | user_id | date | min_purchase_date_in_group | group |
---|---|---|---|---|
1 | 123 | 2023-04-01 | 2023-04-01 | 1 |
2 | 123 | 2023-04-01 | 2023-04-01 | 1 |
3 | 123 | 2023-04-23 | 2023-04-01 | 1 |
4 | 123 | 2023-05-07 | 2023-04-01 | 1 |
5 | 123 | 2023-06-04 | 2023-04-01 | 1 |
6 | 123 | 2023-06-29 | 2023-04-01 | 1 |
7 | 123 | 2023-07-09 | 2023-07-09 | 2 |
8 | 123 | 2023-07-16 | 2023-07-09 | 2 |
9 | 123 | 2023-07-16 | 2023-07-09 | 2 |
10 | 123 | 2023-08-25 | 2023-07-09 | 2 |
11 | 123 | 2023-09-04 | 2023-07-09 | 2 |
12 | 123 | 2023-10-11 | 2023-10-11 | 3 |
13 | 123 | 2023-10-16 | 2023-10-11 | 3 |
14 | 123 | 2023-12-16 | 2023-10-11 | 3 |
Group 1 all share the same min_purchase_date_in_group since all the purchase dates are within 90 days of the first purchase.
Row 7 is the start of a new group because it is more than 90 days from the first purchase date (2023-04-01) and all the dates within group 2 are 90 days within the first purchase date (2023-07-09) for the group.
Row 12 starts a new group because the first purchase date (2023-10-11) for this group is more than 90 days from the first visit for group 2 (2023-07-09).
I was able to perform this in Python, but I am looking to solve this using SQL.
I am somewhat able to solve this if the dates are further apart by checking the previous row value and determining if it’s been more than 90 days.
If the dates are closer, like my example, my query assigns the same group to all 14 rows.
Here is what I currently have in order to solve for min_purchase_date_in_group:
select
user_id,
date,
case
when prev_date is null then date
when prev_date is not null and datediff(day, first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following), date) < 90 then first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following)
end as min_purchase_date_in_group
from purchases
It applies the date 2023-04-01 until row 7, then it is null. I’m unsure how to make row 7 become 2023-07-09.
I’m also unsure how to determine the group number.
2
Answers
There is a SQL technique for "gaps and islands" [finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates] which is relevant to your need.
First, rank the data using LAG function to get the previous date for each row, while partitioning by user_id
Then calculate a change indicator (isGroupChange) based on the 90-day gap between dates or the absence of a previous date.
Use the SUM window function to create a cumulative sum of the change indicator, which results in a grouping number for each set of consecutive rows until the next isGroupChange.
—
fiddle
This works for Postgres.
But not for Redshift, which is very different and does not support table functions.
(You tagged both.)
Start and end of each next group depend on the start of the previous group. This is very hard (and expensive) to solve with just set-based logic. I suggest a procedural approach – returning whole groups at a time to keep it snappy.
I replaced "row" with "id" and "group" with "grp", as those are reserved words in SQL and I’d rather not use them as identifiers.
Create this function once:
fiddle
Call:
Works for any number of distinct
user_id
.About ROW value comparison:
You absolutely need this index for performance (unless you already have it, maybe as PK or UNIQUE constraint?):
Related answers with more explanation: