SQL: how to filter records by most recent occurrence within a cohort of people - Postgresql

crimson
April 27, 2023
321 views
3 votes
3 Answers

I’m working with the following tables in a Postgres database:

card_purchases

purchaser_id    purchase_date   card_id
44              12/10/2021      3
3               1/27/2022       1
19              1/31/2022       2
22              2/15/2022       1
4               6/9/2022        4
17              8/20/2022       2
19              2/4/2023        2
22              3/17/2023       1
3               3/19/2023       2
747             3/24/2023       2
1193            4/14/2023       1

card_templates

card_id   card_name
1         discount pass
2         exclusive pass
3         senior citizen discount
4         customer loyalty

discounts

discount_id     discount_name
1001            goodwill applied
1002            discount pass applied
1003            exclusive pass applied
1004            exclusive pass applied
1005            discount pass applied
1006            exclusive pass applied
1007            exclusive pass applied
1008            discount pass applied
1009            exclusive pass applied
1010            exclusive pass applied
1011            discount pass applied
1012            exclusive pass applied
1013            exclusive pass applied
1014            exclusive pass applied
1015            exclusive pass applied
1016            discount pass applied

discount_orders

order_id    discount_id         created_at  purchaser_id
1100            1002            1/28/2022   3
1101            1003            1/31/2022   19
1102            1004            2/4/2022    19
1103            1005            3/15/2022   22
1104            1006            3/17/2022   19
1105            1007            8/27/2022   17
1106            1008            8/30/2022   22
1107            1009            2/4/2023    19
1108            1010            2/19/2023   19
1109            1011            3/18/2023   22
1110            1012            3/19/2023   19
1111            1013            3/31/2023   747
1112            1014            4/5/2023    19
1113            1015            4/15/2023   747
1114            1016            4/20/2023   1193

Here is a SQL Fiddle with all of the data: http://sqlfiddle.com/#!17/e64c2/1

By way of some background, our customers buy "cards" (i.e. passes), which give them the right to purchase our products at a discounted price. The cards are typically valid for 1 year, at which point customers can purchase another card.

My goal is to see the number of purchases that each customer has made after they most recently purchased a card. The problem is that I don’t know how to compute the count if a customer has purchased more than one card.

For example, customer 19 purchased two cards: on 1/31/22 and then again on 2/4/23. This customer used the second card a total of 4 times.

SELECT
    dord.purchaser_id                   AS user_id
    , COUNT(DISTINCT(dord.order_id))    AS count_orders
FROM
    discounts d
JOIN
    discount_orders dord
ON d.discount_id = dord.discount_id
WHERE
    d.discount_name LIKE '%pass%'
    AND dord.purchaser_id IN (
            SELECT
                cp.purchaser_id
            FROM
                card_purchases cp
            JOIN
                card_templates ct
            ON cp.card_id = ct.card_id
            WHERE
                ct.card_id IN (1, 2)
        )
GROUP BY
    dord.purchaser_id

This code is outputting a count of 7 purchases for customer 19 (which is correct). However, I’m only interested in seeing purchased made on or after the purchase of the most recent card. So, for customer 19, this would be 4 purchases (because the 2nd card was purchased on 2/4/23 and there were 4 purchases made on or after this date).

How do we fix this code such that the count is only applied to orders placed on or after the customer’s most recent purchase of a card?

Answers

You can use row_number() window function to select only the last card purchase of the customers.

Query:

with recent_card_purchases as
(
   SELECT *,row_number()over(partition by purchaser_id order by purchase_date desc)rn FROM card_purchases cp
                        JOIN
                            card_templates ct
                        ON cp.card_id = ct.card_id
                        WHERE
                            ct.card_id IN (1, 2)
)
SELECT
    dord.purchaser_id                   AS user_id
    , COUNT(DISTINCT(dord.order_id))    AS count_orders

FROM
    discounts d

JOIN
    discount_orders dord
ON d.discount_id = dord.discount_id
join recent_card_purchases rcp 
  on rcp.purchaser_id=dord.purchaser_id and dord.created_at >=rcp.purchase_date    
WHERE
    d.discount_name LIKE '%pass%' and rcp.rn=1
GROUP BY
    dord.purchaser_id

Output
| user_id | count_orders |
|——–:|————-:|
| 17 | 1 |
| 19 | 4 |
| 22 | 1 |
| 747 | 2 |
| 1193 | 1 |

fiddle

The overall query logic looks good, we just need to adapt the subquery filtering. Instead of IN, we can use a correlated subquery that returns the date of the latest card purchase of the current purchaser:

select do.purchaser_id, count(*)
from discount_orders do
inner join discounts d on d.discount_id = do.discount_id
where d.discount_name like '%pass%'
  and do.created_at >= (
        select max(cp.purchase_date)
        from cards_purchase cp
        where cp.card_id in (1, 2)
          and cp.purchaser_id = do.purchaser_id -- correlation 
    )
group by do.purchaser_id

Note that, for the data that you have shown, there is no need to use distinct. Also the join on cards_template in the subquery seems unnecessary.

We can also express the logic with a join:

select do.purchaser_id, count(*)
from discount_orders do
inner join discounts d on d.discount_id = do.discount_id
inner join (
    select cp.purchaser_id, max(cp.purchase_date) purchase_date
    from cards_purchase cp
    where cp.card_id in (1, 2)
) cp on cp.purchaser_id = do.purchaser_id and cp.purchase_date <= do.created_at    
where d.discount_name like '%pass%'
group by do.purchaser_id

This is a perfect opportunity to apply a lateral join. Find the card purchases of interest and then query for the number of orders. In this instance a scalar subquery also works since there’s only a single value to return. The lateral join would allow for other columns to be added.

-- lateral join
select * from (
    select purchaser_id, max(purchase_date) as last_purchase_date
    from card_purchases
    where card_id in (1, 2)
    group by purchaser_id
) mcp inner join lateral (
    select count(1) as cnt from discount_orders as o
    where o.purchaser_id = mcp.purchaser_id and o.created_at >= mcp.last_purchase_date
) mdo on true;

Equivalent to:

-- scalar subquery
select *,
    (
    select count(1) as cnt from discount_orders as o
    where o.purchaser_id = mcp.purchaser_id and o.created_at >= mcp.last_purchase_date
    ) as cnt
from (
    select purchaser_id, max(purchase_date) as last_purchase_date
    from card_purchases
    where card_id in (1, 2)
    group by purchaser_id
) mcp;

I removed the filter on the discount name as I don’t believe it’s necessary here.

https://dbfiddle.uk/5QN_gb2s?hide=4

Please signup or login to give your own answer.

Click here to cancel reply.

SQL: how to filter records by most recent occurrence within a cohort of people – Postgresql

Answers