skip to Main Content

I have my table like this:

event_date user_id
2021-10-01 1
2021-10-01 2
2021-10-02 1
2021-10-02 2
2021-10-02 3
2021-10-03 3
2021-10-03 4
2021-10-03 4
2021-10-04 5
2021-10-04 2
2021-10-06 3
2021-10-06 5

Sample data

WITH my_table (event_date, user_id) AS (
    values ('2021-10-01', 1),
        ('2021-10-01', 2),
        ('2021-10-02', 1),
        ('2021-10-02', 2),
        ('2021-10-02', 3),
        ('2021-10-03', 3),
        ('2021-10-03', 4),
        ('2021-10-03', 4),
        ('2021-10-04', 5),
        ('2021-10-04', 2),
        ('2021-10-06', 3),
        ('2021-10-06', 5)
) 
SELECT *
FROM my_table

And I want to calculate moving count distinct user_id for the last 3 days.
Desired table should look like this:

event_date count_distinct_user_id_last_3_days
2021-10-01 2
2021-10-02 3
2021-10-03 4
2021-10-04 5
2021-10-05 4
2021-10-06 4

Thanks for your help!

2

Answers


  1. You can get an approximate moving count distinct via the following query below. Unfortunately, DISTINCT is not currently supported in window functions, so you can’t use COUNT(DISTINCT user_id) OVER (...)

    WITH my_table (event_date, user_id) AS (
        values ('2021-10-01', 1),
            ('2021-10-01', 2),
            ('2021-10-02', 1),
            ('2021-10-02', 2),
            ('2021-10-02', 3),
            ('2021-10-03', 3),
            ('2021-10-03', 4),
            ('2021-10-03', 4),
            ('2021-10-04', 5),
            ('2021-10-04', 2),
            ('2021-10-06', 3),
            ('2021-10-06', 5)
    )
    SELECT 
        event_date, 
        approx_distinct(user_id) OVER (
            ORDER BY cast(event_date AS date) 
            RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND CURRENT ROW)
    FROM my_table
    
    Login or Signup to reply.
  2. Since you are missing some days in the data (and you want to show all dates) and Athena version of Presto does not support date ranges you can try approach inserting missing dates with window functions based on rows and leveraging array functions:

    WITH my_table (event_date, user_id) AS (
        values ('2021-10-01', 1),
            ('2021-10-01', 2),
            ('2021-10-02', 1),
            ('2021-10-02', 2),
            ('2021-10-02', 3),
            ('2021-10-03', 3),
            ('2021-10-03', 4),
            ('2021-10-03', 4),
            ('2021-10-04', 5),
            ('2021-10-04', 2),
            ('2021-10-06', 3),
            ('2021-10-06', 5)
    ),
    my_table_dates as (
        SELECT date(event_date) event_date, user_id
        from my_table
    ),
    grouped as(
        SELECT event_date, 
            array_distinct(array_agg(user_id)) users
        FROM my_table_dates
        group by event_date
    ),
    with_dates_ranges as (
        select *, 
            sequence(lag(event_date) over(order by event_date) + interval '1' day, event_date, interval '1' day) dates
        from grouped
    ),
    with_gaps_filled as(
        select d event_date,
            if(d = event_date, users, array[]) users
        from with_dates_ranges,
        unnest (coalesce(dates, array[event_date])) t(d) -- coalesce for the first day
    )
    
    -- final query
    select date(event_date) event_date,
        cardinality(array_distinct(
            flatten(array_agg(users) over(order by event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))
        )) count_distinct_user_id_last_3_days
    from with_gaps_filled
    

    Output:

    event_date count_distinct_user_id_last_3_days
    2021-10-01 2
    2021-10-02 3
    2021-10-03 4
    2021-10-04 5
    2021-10-05 4
    2021-10-06 3
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search