I have a table partitioned by date
-type column ds
, with many fields.
Not all columns change every day, therefore most rows are just duplicates of the previous rows.
I want to create an SCD2 table from the existing one (partitioned) and get
dt_start
– start of period when a given combination of values started repeatingdt_end
– end of that period
If the record the current combination, then dt_end is NULL
.
I thought about using a window function:
select ds as dt_start,
/*__*/(ds) over w1 as dt_end,
/*all_columns in the table*/
from public.app
window w1 as (partition by user_id, country_id order by ds)
group by /*all_columns in the table*/
Example input
CREATE TABLE public.app(
ds date NULL,
user_id int4 NULL,
country_id int2 NULL,
n_sessions_1d int2 NULL,
n_sessions_3d int2 NULL,
n_sessions_1w int2 NULL,
n_sessions_2w int2 NULL,
n_sessions_1m int2 NULL,
total_time_spent_1d int4 NULL,
total_time_spent_3d int4 NULL,
total_time_spent_1w int4 NULL,
total_time_spent_2w int4 NULL,
total_time_spent_1m int4 NULL,
is_subscription_1d int2 NULL,
is_subscription_3d int2 NULL
)
PARTITION BY RANGE (ds);
CREATE INDEX idx ON ONLY public.app USING btree (user_id, country_id);
CREATE TABLE public.app_202409 PARTITION OF public.app
FOR VALUES FROM ('2024-09-01') TO ('2024-09-30');
INSERT INTO public.app VALUES
('yesterday'::date,1,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date, 1,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('tomorrow'::date, 1,1,0,0,0,0,0,1,1,1,1,1,2,2)
--different user:
,('yesterday'::date,2,1,0,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date, 2,1,1,0,0,0,0,1,1,1,1,1,2,2)--changed
,('tomorrow'::date, 2,1,1,0,0,0,0,1,1,1,1,1,2,2)
,('today'::date+2, 2,1,0,0,0,0,0,1,1,1,1,1,2,2)--changed back
,('today'::date+3, 2,1,0,0,0,0,0,1,1,1,1,1,2,2)
returning *;
Example output:
dt_start | dt_end | user_id | country_id | n_sessions_1d | n_sessions_3d | n_sessions_1w | n_sessions_2w | n_sessions_1m | total_time_spent_1d | total_time_spent_3d | total_time_spent_1w | total_time_spent_2w | total_time_spent_1m | is_subscription_1d | is_subscription_3d |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2024-09-16 | null | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
2024-09-16 | 2024-09-16 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
2024-09-17 | 2024-09-18 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
2024-09-19 | null | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
2
Answers
Your idea combining window functions with aggregation, could work:
between unbounded preceding and
unbounded following
to override the defaultbetween unbounded preceding and
current row
that comes with anorder by
and no frame definition.lag()
of all columns. Useis distinct from
to make itnull
-safe. Add a failover self-comparison for the first row which otherwise wouldn’t have anylag()
to compare to.count(*)
to establish sequences of duplicates.user_id
andcountry_id
, take itsmin(ds) as dt_start
andmax(ds) as dt_end
.not bool_or(is_latest)
orbool_and(not is_latest)
to see if the group contains the last record, and if so, swapdt_end
for anull
.demo at db<>fiddle
An alternative would be using correlated scalar subqueries. Each row looks ahead for a
max(ds)
, but no sooner than the earliesds
where the values differ for the sameuser_id
andcountry_id
:demo at db<>fiddle