Postgresql - How to group users into 90 day groups

dng
October 20, 2023
281 views
0 votes
2 Answers

I am looking to group users into separate groups based on when they bought something.

A new group starts after it’s been more than 90 days when the user last bought an item.

Example of what I am looking for:

row	user_id	date	min_purchase_date_in_group	group
1	123	2023-04-01	2023-04-01	1
2	123	2023-04-01	2023-04-01	1
3	123	2023-04-23	2023-04-01	1
4	123	2023-05-07	2023-04-01	1
5	123	2023-06-04	2023-04-01	1
6	123	2023-06-29	2023-04-01	1
7	123	2023-07-09	2023-07-09	2
8	123	2023-07-16	2023-07-09	2
9	123	2023-07-16	2023-07-09	2
10	123	2023-08-25	2023-07-09	2
11	123	2023-09-04	2023-07-09	2
12	123	2023-10-11	2023-10-11	3
13	123	2023-10-16	2023-10-11	3
14	123	2023-12-16	2023-10-11	3

Group 1 all share the same min_purchase_date_in_group since all the purchase dates are within 90 days of the first purchase.

Row 7 is the start of a new group because it is more than 90 days from the first purchase date (2023-04-01) and all the dates within group 2 are 90 days within the first purchase date (2023-07-09) for the group.

Row 12 starts a new group because the first purchase date (2023-10-11) for this group is more than 90 days from the first visit for group 2 (2023-07-09).

I was able to perform this in Python, but I am looking to solve this using SQL.

I am somewhat able to solve this if the dates are further apart by checking the previous row value and determining if it’s been more than 90 days.

If the dates are closer, like my example, my query assigns the same group to all 14 rows.

Here is what I currently have in order to solve for min_purchase_date_in_group:

select 
  user_id,
  date,
  case
    when prev_date is null then date
    when prev_date is not null and datediff(day, first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following), date) < 90 then first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following)
  end as min_purchase_date_in_group
from purchases

It applies the date 2023-04-01 until row 7, then it is null. I’m unsure how to make row 7 become 2023-07-09.

I’m also unsure how to determine the group number.

Answers

There is a SQL technique for "gaps and islands" [finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates] which is relevant to your need.

First, rank the data using LAG function to get the previous date for each row, while partitioning by user_id
Then calculate a change indicator (isGroupChange) based on the 90-day gap between dates or the absence of a previous date.
Use the SUM window function to create a cumulative sum of the change indicator, which results in a grouping number for each set of consecutive rows until the next isGroupChange.

—

WITH RankedPurchases
AS (
    SELECT *
        , LAG(min_purchase_date_in_group) 
              OVER (PARTITION BY user_id ORDER BY DATE) AS prev_min_purchase_date
    FROM purchases
    )
    , GroupChanges
AS (
    SELECT *
        , CASE 
            WHEN prev_min_purchase_date IS NULL OR DATE - prev_min_purchase_date > 90 THEN 1
            ELSE 0
            END AS isGroupChange
    FROM RankedPurchases
    )
    , GroupNumbering
AS (
    SELECT *
        , SUM(isGroupChange) OVER (
            PARTITION BY user_id ORDER BY DATE
            ) AS purchase_group
    FROM GroupChanges
    )
SELECT
      row
    , user_id
    , DATE
    , min_purchase_date_in_group
    , purchase_group
FROM GroupNumbering
ORDER BY DATE;

row	user_id	date	min_purchase_date_in_group	purchase_group
1	123	2023-04-01	2023-04-01	1
2	123	2023-04-01	2023-04-01	1
3	123	2023-04-23	2023-04-01	1
4	123	2023-05-07	2023-04-01	1
5	123	2023-06-04	2023-04-01	1
6	123	2023-06-29	2023-04-01	1
7	123	2023-07-09	2023-07-09	2
8	123	2023-07-16	2023-07-09	2
9	123	2023-07-16	2023-07-09	2
10	123	2023-08-25	2023-07-09	2
11	123	2023-09-04	2023-07-09	2
12	123	2023-10-11	2023-10-11	3
13	123	2023-10-16	2023-10-11	3
14	123	2023-12-16	2023-10-11	3

fiddle

- ErwinBrandstetter
- October 20, 2023 at 4:37 am
- 0 votes
0
This works for Postgres.
But not for Redshift, which is very different and does not support table functions.
(You tagged both.)

Start and end of each next group depend on the start of the previous group. This is very hard (and expensive) to solve with just set-based logic. I suggest a procedural approach – returning whole groups at a time to keep it snappy.

I replaced "row" with "id" and "group" with "grp", as those are reserved words in SQL and I’d rather not use them as identifiers.

Create this function once:
```
CREATE OR REPLACE FUNCTION f_purchase_groups()
  RETURNS TABLE (id int, user_id int, date date, min_purchase_date_in_group date, grp int)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _user_id int := -1;             -- init to smaller than all user_id (!)
   _grp_date date := '-infinity';  -- init to -infinity
   _grp int := 0;                  -- init to 0
BEGIN
   WHILE _user_id IS NOT NULL
   LOOP
      SELECT p.user_id,    p.date, _grp + 1
      INTO    _user_id, _grp_date, _grp
      FROM   purchases p
      WHERE  (p.user_id, p.date) > (_user_id, _grp_date + 90)  -- ROW value comparison!
      ORDER  BY p.user_id, p.date
      LIMIT  1;

      RETURN QUERY
      SELECT p.id, p.user_id, p.date, _grp_date, _grp
      FROM   purchases p
      WHERE  p.user_id = _user_id
      AND    p.date   >= _grp_date
      AND    p.date   <= _grp_date + 90
      ORDER  BY p.user_id, p.date;
   END LOOP;
END
$func$;
```
fiddle

Call:
```
SELECT * FROM f_purchase_groups();
```
Works for any number of distinct user_id.

About ROW value comparison:
- SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
You absolutely need this index for performance (unless you already have it, maybe as PK or UNIQUE constraint?):
```
CREATE INDEX purchases_user_id_date ON purchases (user_id, date);
```
Related answers with more explanation:
- Group by repeating attribute
- Sliding 1-hour periods aggregation query
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to group users into 90 day groups

Answers