skip to Main Content

I have a table partitioned by date-type column ds, with many fields.
Not all columns change every day, therefore most rows are just duplicates of the previous rows.

I want to create an SCD2 table from the existing one (partitioned) and get

  1. dt_start – start of period when a given combination of values started repeating
  2. dt_end – end of that period

If the record the current combination, then dt_end is NULL.

I thought about using a window function:

select ds as dt_start,
       /*__*/(ds) over w1 as dt_end,
       /*all_columns in the table*/
from public.app
window w1 as (partition by user_id, country_id order by ds)
group by /*all_columns in the table*/

Example input

CREATE TABLE public.app(
    ds date NULL,
    user_id int4 NULL,
    country_id int2 NULL,
    n_sessions_1d int2 NULL,
    n_sessions_3d int2 NULL,
    n_sessions_1w int2 NULL,
    n_sessions_2w int2 NULL,
    n_sessions_1m int2 NULL,
    total_time_spent_1d int4 NULL,
    total_time_spent_3d int4 NULL,
    total_time_spent_1w int4 NULL,
    total_time_spent_2w int4 NULL,
    total_time_spent_1m int4 NULL,
    is_subscription_1d int2 NULL,
    is_subscription_3d int2 NULL
)
PARTITION BY RANGE (ds);
CREATE INDEX idx ON ONLY public.app USING btree (user_id, country_id);
CREATE TABLE public.app_202409 PARTITION OF public.app
    FOR VALUES FROM ('2024-09-01') TO ('2024-09-30');

INSERT INTO public.app VALUES
 ('yesterday'::date,1,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date,    1,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('tomorrow'::date, 1,1,0,0,0,0,0,1,1,1,1,1,2,2)
--different user:
,('yesterday'::date,2,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date,    2,1,1,0,0,0,0,1,1,1,1,1,2,2)--changed
,('tomorrow'::date, 2,1,1,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date+2,  2,1,0,0,0,0,0,1,1,1,1,1,2,2)--changed back
,('today'::date+3,  2,1,0,0,0,0,0,1,1,1,1,1,2,2)
returning *;

Example output:

dt_start dt_end user_id country_id n_sessions_1d n_sessions_3d n_sessions_1w n_sessions_2w n_sessions_1m total_time_spent_1d total_time_spent_3d total_time_spent_1w total_time_spent_2w total_time_spent_1m is_subscription_1d is_subscription_3d
2024-09-16 null 1 1 0 0 0 0 0 1 1 1 1 1 2 2
2024-09-16 2024-09-16 2 1 0 0 0 0 0 1 1 1 1 1 2 2
2024-09-17 2024-09-18 2 1 1 0 0 0 0 1 1 1 1 1 2 2
2024-09-19 null 2 1 0 0 0 0 0 1 1 1 1 1 2 2

2

Answers


  1. Your idea combining window functions with aggregation, could work:

    1. Widen your frame to between unbounded preceding andunbounded following to override the default between unbounded preceding andcurrent row that comes with an order by and no frame definition.
    2. In each row, compare all columns to lag() of all columns. Use is distinct from to make it null-safe. Add a failover self-comparison for the first row which otherwise wouldn’t have any lag() to compare to.
    3. Wrap that in a subquery and use a stepping/tumbling count(*) to establish sequences of duplicates.
    4. For each sequence of duplicates for a given user_id and country_id, take its min(ds) as dt_start and max(ds) as dt_end.
    5. Use not bool_or(is_latest) or bool_and(not is_latest) to see if the group contains the last record, and if so, swap dt_end for a null.

    demo at db<>fiddle

    select min(ds) as dt_start 
         , case when every(not is_latest) then max(ds) end as dt_end
         , user_id, country_id, n_sessions_1d, n_sessions_3d, n_sessions_1w, n_sessions_2w, n_sessions_1m, total_time_spent_1d, total_time_spent_3d, total_time_spent_1w, total_time_spent_2w, total_time_spent_1m, is_subscription_1d, is_subscription_3d
    from(select*, count(*)filter(where is_changing)over w2 as sequence_of_duplicates
         from(select*, (n_sessions_1d, n_sessions_3d, n_sessions_1w, n_sessions_2w, n_sessions_1m, total_time_spent_1d, total_time_spent_3d, total_time_spent_1w, total_time_spent_2w, total_time_spent_1m, is_subscription_1d, is_subscription_3d)
                       is distinct from
                       lag((n_sessions_1d, n_sessions_3d, n_sessions_1w, n_sessions_2w, n_sessions_1m, total_time_spent_1d, total_time_spent_3d, total_time_spent_1w, total_time_spent_2w, total_time_spent_1m, is_subscription_1d, is_subscription_3d)
                            ,1,(n_sessions_1d, n_sessions_3d, n_sessions_1w, n_sessions_2w, n_sessions_1m, total_time_spent_1d, total_time_spent_3d, total_time_spent_1w, total_time_spent_2w, total_time_spent_1m, is_subscription_1d, is_subscription_3d))
                       over w1 as is_changing
                     , lead(ds)over w1 is null as is_latest
              from public.app
              window w1 as (partition by user_id, country_id
                            order by ds
                            rows between unbounded preceding 
                                     and unbounded following) )s1
         window w2 as (partition by user_id, country_id
                       order by ds
                       rows between unbounded preceding 
                                and current row) )s2
    group by sequence_of_duplicates, user_id, country_id, n_sessions_1d, n_sessions_3d, n_sessions_1w, n_sessions_2w, n_sessions_1m, total_time_spent_1d, total_time_spent_3d, total_time_spent_1w, total_time_spent_2w, total_time_spent_1m, is_subscription_1d, is_subscription_3d
    order by 1, user_id, country_id;
    
    dt_start dt_end user_id country_id n_sessions_1d n_sessions_3d n_sessions_1w n_sessions_2w n_sessions_1m total_time_spent_1d total_time_spent_3d total_time_spent_1w total_time_spent_2w total_time_spent_1m is_subscription_1d is_subscription_3d
    2024-09-16 null 1 1 0 0 0 0 0 1 1 1 1 1 2 2
    2024-09-16 2024-09-16 2 1 0 0 0 0 0 1 1 1 1 1 2 2
    2024-09-17 2024-09-18 2 1 1 0 0 0 0 1 1 1 1 1 2 2
    2024-09-19 null 2 1 0 0 0 0 0 1 1 1 1 1 2 2
    Login or Signup to reply.
  2. An alternative would be using correlated scalar subqueries. Each row looks ahead for a max(ds), but no sooner than the earlies ds where the values differ for the same user_id and country_id:
    demo at db<>fiddle

    select distinct on(a1.user_id, a1.country_id, dt_end)
           ds as dt_start
         , (select max(ds)
            from public.app as a2
            where (a2.user_id, a2.country_id, a2.n_sessions_1d, a2.n_sessions_3d, a2.n_sessions_1w, a2.n_sessions_2w, a2.n_sessions_1m, a2.total_time_spent_1d, a2.total_time_spent_3d, a2.total_time_spent_1w, a2.total_time_spent_2w, a2.total_time_spent_1m, a2.is_subscription_1d, a2.is_subscription_3d)
                  is not distinct from
                  (a1.user_id, a1.country_id, a1.n_sessions_1d, a1.n_sessions_3d, a1.n_sessions_1w, a1.n_sessions_2w, a1.n_sessions_1m, a1.total_time_spent_1d, a1.total_time_spent_3d, a1.total_time_spent_1w, a1.total_time_spent_2w, a1.total_time_spent_1m, a1.is_subscription_1d, a1.is_subscription_3d)
            and a2.ds<(select min(ds) 
                       from public.app as a3
                       where (a3.n_sessions_1d, a3.n_sessions_3d, a3.n_sessions_1w, a3.n_sessions_2w, a3.n_sessions_1m, a3.total_time_spent_1d, a3.total_time_spent_3d, a3.total_time_spent_1w, a3.total_time_spent_2w, a3.total_time_spent_1m, a3.is_subscription_1d, a3.is_subscription_3d)
                             is distinct from
                             (a1.n_sessions_1d, a1.n_sessions_3d, a1.n_sessions_1w, a1.n_sessions_2w, a1.n_sessions_1m, a1.total_time_spent_1d, a1.total_time_spent_3d, a1.total_time_spent_1w, a1.total_time_spent_2w, a1.total_time_spent_1m, a1.is_subscription_1d, a1.is_subscription_3d)
                         and a3.user_id=a1.user_id
                         and a3.country_id=a1.country_id
                         and a3.ds>a1.ds )
            )as dt_end
        , a1.user_id, a1.country_id, a1.n_sessions_1d, a1.n_sessions_3d, a1.n_sessions_1w, a1.n_sessions_2w, a1.n_sessions_1m, a1.total_time_spent_1d, a1.total_time_spent_3d, a1.total_time_spent_1w, a1.total_time_spent_2w, a1.total_time_spent_1m, a1.is_subscription_1d, a1.is_subscription_3d
    from public.app as a1
    group by 1,2,a1.user_id, a1.country_id, a1.n_sessions_1d, a1.n_sessions_3d, a1.n_sessions_1w, a1.n_sessions_2w, a1.n_sessions_1m, a1.total_time_spent_1d, a1.total_time_spent_3d, a1.total_time_spent_1w, a1.total_time_spent_2w, a1.total_time_spent_1m, a1.is_subscription_1d, a1.is_subscription_3d
    order by a1.user_id, a1.country_id, dt_end, dt_start;
    
    dt_start dt_end user_id country_id n_sessions_1d n_sessions_3d n_sessions_1w n_sessions_2w n_sessions_1m total_time_spent_1d total_time_spent_3d total_time_spent_1w total_time_spent_2w total_time_spent_1m is_subscription_1d is_subscription_3d
    2024-09-16 null 1 1 0 0 0 0 0 1 1 1 1 1 2 2
    2024-09-16 2024-09-16 2 1 0 0 0 0 0 1 1 1 1 1 2 2
    2024-09-17 2024-09-18 2 1 1 0 0 0 0 1 1 1 1 1 2 2
    2024-09-19 null 2 1 0 0 0 0 0 1 1 1 1 1 2 2
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search