skip to Main Content

I have a subscription table with 4 fields: id, customer_id, start_date and end_date.
It lists all subscriptions of my customers. No subscription has an empty end_date, and a customer can have several subscriptions at once.
For example, a customer id 37 can have the following subscriptions:

id  customer_id start_at    end_at
44  37  2019-03-21  2019-03-21
17819   37  2020-03-23  2020-03-23
22302   37  2020-04-24  2021-07-25
42213   37  2021-04-25  2023-04-26
92013   37  2023-04-26  2024-04-26

These records mean that customer 37 was a subscriber on 2019-03-21, then on 2020-03-23, then from 2020-04-24 to 2024-04-26, for a total of 1463 days.

I am trying to write a query to get the number of days each customer has been a subscriber in a given period.
Customer 37 has been a subscriber 365 days in 2023.
Subscriptions can overlap, as a subscriber can have multiple subscriptions at once.

The result of the query should be something like:

customer_id total_subscription_days
37  1463
38  526
39  426
40  365
41  325

My database is running on MySQL 8.2.12.

I tried using lag, lead, CTEs, least and greatest, to no avail. I tried chatgpt and stackoverflow.

EDIT: here below is what I tried so far:

1st try:

SELECT 
    customer_id,
    SUM(DATEDIFF(
        LEAST(end_at, '2023-12-31'), 
        GREATEST(start_at, '2023-01-01')
    ) + 1) AS total_subscription_days
FROM (
    SELECT 
        customer_id,
        start_at,
        end_at
    FROM 
        subscription
    WHERE 
        start_at <= '2023-12-31' AND end_at >= '2023-01-01'
    UNION ALL
    SELECT 
        s1.customer_id,
        LEAD(s1.end_at) OVER (PARTITION BY s1.customer_id ORDER BY s1.end_at),
        '2023-12-31'
    FROM 
        subscription s1
    LEFT JOIN 
        subscription s2 ON s1.customer_id = s2.customer_id 
                       AND s1.end_at < s2.start_at
    WHERE 
        s2.start_at IS NOT NULL
) AS merged_subscriptions
GROUP BY 
    customer_id;

Even though I wanted to know subscription days in 2023, I had results bigger than 365 days. So it seems to count duplicates because of the join.

2nd try:

WITH subscription_periods AS (
    SELECT 
        customer_id,
        start_at,
        end_at,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY start_at) AS period_number
    FROM 
        subscription
    WHERE 
        start_at <= '2023-12-31' AND end_at >= '2023-01-01' AND customer_id < 100
),
subscription_days AS (
    SELECT 
        customer_id,
        SUM(
            DATEDIFF(
                LEAST(end_at, '2023-12-31'), 
                GREATEST(start_at, '2023-01-01')
            ) + 1
        ) AS days
    FROM 
        (
            SELECT 
                customer_id,
                start_at,
                LEAD(end_at) OVER (PARTITION BY customer_id ORDER BY start_at) AS end_at
            FROM 
                subscription_periods
        ) AS overlapping_periods
    WHERE 
        end_at >= '2023-01-01'
    GROUP BY 
        customer_id
)
SELECT 
    customer_id,
    SUM(days) AS total_subscription_days
FROM 
    subscription_days
GROUP BY 
    customer_id;

I restricted to the first 100 customers, otherwise I was getting a 504 error.
This query seems to not be able to take into accounts gaps between subscription.
For a customer having been subscribed from 2023-01-01 to 2023-04-01 and then from 2023-05-01 to 2023-08-01, it seems to be counting days between 2023-01-01 and 2023-08-01.

2

Answers


  1. It is kind of a merge overlapping intervals problem:

    • Find all rows that intersect the given range e.g. 2023-01-01 to 2023-12-31 (cte1 where clause)
    • Clamp the start and end dates if they’re outside this range (cte1 select clause)
    • Group rows as follows:
      If there is gap between end date of one row and start date of next then don’t combine otherwise combine (cte2 and cte3)
    • Once combined, calculate the difference between the min and max date of each group (outermost query)
    set @date1 = '2023-01-01';
    set @date2 = '2023-12-31';
    
    with cte1 as (
        select
            customer_id,
            greatest(start_at, @date1) as period_date1, 
            least(end_at, @date2) as period_date2
        from t
        where start_at <= @date2 and end_at >= @date1
    ), cte2 as (
        select *, case when period_date1 <= lag(period_date2) over (partition by customer_id order by period_date1) then 0 else 1 end as newgrp
        from cte1
    ), cte3 as (
        select *, sum(newgrp) over (partition by customer_id order by period_date1) as grpnum
        from cte2
    )
    select customer_id, min(period_date1) as group_date1, max(period_date2) as group_date2, datediff(max(period_date2), min(period_date1)) + 1 as diff
    from cte3
    group by customer_id, grpnum
    

    Demo on DB<>Fiddle

    Login or Signup to reply.
  2. One option is identifying overlapping date ranges with the arithmetic condition t2.start_at > t1.start_at and t2.start_at < t1.end_at. then, calculate the full extent of these overlaps using LEAST and GREATEST functions, and then exclude any ranges within these overlaps that have already been calculated :

    with cte as (
      select t1.id, 
           t1.customer_id, 
           coalesce(least(t1.start_at, t2.start_at), t1.start_at) as start_at,
           coalesce(greatest(t1.end_at, t2.end_at), t1.end_at) as end_at,
           t2.id as to_del
      from mytable t1
      left join mytable t2 on t1.customer_id = t2.customer_id
                            and t2.start_at > t1.start_at
                            and t2.start_at < t1.end_at
    )
    select customer_id, sum(DATEDIFF(end_at, start_at)) as total_subscription_days
    from cte
    where id not in (select to_del from cte where to_del is not null)
    group by customer_id;
    

    Results :

    customer_id total_subscription_days
    37          1463
    

    Demo here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search