skip to Main Content

I am looking to group users into separate groups based on when they bought something.

A new group starts after it’s been more than 90 days when the user last bought an item.

Example of what I am looking for:

row user_id date min_purchase_date_in_group group
1 123 2023-04-01 2023-04-01 1
2 123 2023-04-01 2023-04-01 1
3 123 2023-04-23 2023-04-01 1
4 123 2023-05-07 2023-04-01 1
5 123 2023-06-04 2023-04-01 1
6 123 2023-06-29 2023-04-01 1
7 123 2023-07-09 2023-07-09 2
8 123 2023-07-16 2023-07-09 2
9 123 2023-07-16 2023-07-09 2
10 123 2023-08-25 2023-07-09 2
11 123 2023-09-04 2023-07-09 2
12 123 2023-10-11 2023-10-11 3
13 123 2023-10-16 2023-10-11 3
14 123 2023-12-16 2023-10-11 3

Group 1 all share the same min_purchase_date_in_group since all the purchase dates are within 90 days of the first purchase.

Row 7 is the start of a new group because it is more than 90 days from the first purchase date (2023-04-01) and all the dates within group 2 are 90 days within the first purchase date (2023-07-09) for the group.

Row 12 starts a new group because the first purchase date (2023-10-11) for this group is more than 90 days from the first visit for group 2 (2023-07-09).

I was able to perform this in Python, but I am looking to solve this using SQL.

I am somewhat able to solve this if the dates are further apart by checking the previous row value and determining if it’s been more than 90 days.

If the dates are closer, like my example, my query assigns the same group to all 14 rows.

Here is what I currently have in order to solve for min_purchase_date_in_group:

select 
  user_id,
  date,
  case
    when prev_date is null then date
    when prev_date is not null and datediff(day, first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following), date) < 90 then first_value(date) over(partition by user_id order by date rows between unbounded preceding and unbounded following)
  end as min_purchase_date_in_group
from purchases

It applies the date 2023-04-01 until row 7, then it is null. I’m unsure how to make row 7 become 2023-07-09.

I’m also unsure how to determine the group number.

2

Answers


  1. There is a SQL technique for "gaps and islands" [finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates] which is relevant to your need.

    • First, rank the data using LAG function to get the previous date for each row, while partitioning by user_id

    • Then calculate a change indicator (isGroupChange) based on the 90-day gap between dates or the absence of a previous date.

    • Use the SUM window function to create a cumulative sum of the change indicator, which results in a grouping number for each set of consecutive rows until the next isGroupChange.

    WITH RankedPurchases
    AS (
        SELECT *
            , LAG(min_purchase_date_in_group) 
                  OVER (PARTITION BY user_id ORDER BY DATE) AS prev_min_purchase_date
        FROM purchases
        )
        , GroupChanges
    AS (
        SELECT *
            , CASE 
                WHEN prev_min_purchase_date IS NULL OR DATE - prev_min_purchase_date > 90 THEN 1
                ELSE 0
                END AS isGroupChange
        FROM RankedPurchases
        )
        , GroupNumbering
    AS (
        SELECT *
            , SUM(isGroupChange) OVER (
                PARTITION BY user_id ORDER BY DATE
                ) AS purchase_group
        FROM GroupChanges
        )
    SELECT
          row
        , user_id
        , DATE
        , min_purchase_date_in_group
        , purchase_group
    FROM GroupNumbering
    ORDER BY DATE;
    
    row user_id date min_purchase_date_in_group purchase_group
    1 123 2023-04-01 2023-04-01 1
    2 123 2023-04-01 2023-04-01 1
    3 123 2023-04-23 2023-04-01 1
    4 123 2023-05-07 2023-04-01 1
    5 123 2023-06-04 2023-04-01 1
    6 123 2023-06-29 2023-04-01 1
    7 123 2023-07-09 2023-07-09 2
    8 123 2023-07-16 2023-07-09 2
    9 123 2023-07-16 2023-07-09 2
    10 123 2023-08-25 2023-07-09 2
    11 123 2023-09-04 2023-07-09 2
    12 123 2023-10-11 2023-10-11 3
    13 123 2023-10-16 2023-10-11 3
    14 123 2023-12-16 2023-10-11 3

    fiddle

    Login or Signup to reply.
  2. This works for Postgres.
    But not for Redshift, which is very different and does not support table functions.
    (You tagged both.)

    Start and end of each next group depend on the start of the previous group. This is very hard (and expensive) to solve with just set-based logic. I suggest a procedural approach – returning whole groups at a time to keep it snappy.

    I replaced "row" with "id" and "group" with "grp", as those are reserved words in SQL and I’d rather not use them as identifiers.

    Create this function once:

    CREATE OR REPLACE FUNCTION f_purchase_groups()
      RETURNS TABLE (id int, user_id int, date date, min_purchase_date_in_group date, grp int)
      LANGUAGE plpgsql AS
    $func$
    DECLARE
       _user_id int := -1;             -- init to smaller than all user_id (!)
       _grp_date date := '-infinity';  -- init to -infinity
       _grp int := 0;                  -- init to 0
    BEGIN
       WHILE _user_id IS NOT NULL
       LOOP
          SELECT p.user_id,    p.date, _grp + 1
          INTO    _user_id, _grp_date, _grp
          FROM   purchases p
          WHERE  (p.user_id, p.date) > (_user_id, _grp_date + 90)  -- ROW value comparison!
          ORDER  BY p.user_id, p.date
          LIMIT  1;
    
          RETURN QUERY
          SELECT p.id, p.user_id, p.date, _grp_date, _grp
          FROM   purchases p
          WHERE  p.user_id = _user_id
          AND    p.date   >= _grp_date
          AND    p.date   <= _grp_date + 90
          ORDER  BY p.user_id, p.date;
       END LOOP;
    END
    $func$;
    

    fiddle

    Call:

    SELECT * FROM f_purchase_groups();
    

    Works for any number of distinct user_id.

    About ROW value comparison:

    You absolutely need this index for performance (unless you already have it, maybe as PK or UNIQUE constraint?):

    CREATE INDEX purchases_user_id_date ON purchases (user_id, date);
    

    Related answers with more explanation:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search