I have the following dataset:
create schema m;
create table m.parent_child_lvl_1(customer_id,date,order_type,order_id,sub_id)
as values
(108384372,'18/09/2023'::date,'sub_parent_first_order',5068371361861,407284605)
,(108384372, '13/11/2023', 'sub_order', 5134167539781, null)
,(108384372, '8/01/2024', 'sub_order', 5214687526981, null)
,(108384372, '4/03/2024', 'sub_order', 5283166126149, null)
,(108384372, '18/06/2024', 'sub_parent_order', 5421811138629, 500649255)
,(108384372, '12/08/2024', 'sub_order', 5508433641541, null)
,(108384372, '12/08/2024', 'sub_order', 5508433641541, null);
I need to fill the yellow null’s with the Sub_ID
until it hits the next non-null Sub_ID
, then fill until the next, and so on.
The null order only works based on the Date
ordering.
Also, needs to be grouped by Customer_ID
. The full dataset has 15M rows, so any speed optimisation would be a bonus.
I’ve tried lead()
, lag()
, lastvalue()
, coalesce()
but the closest I got was only filling in the first null value after each real value.
The closest I’ve come to is this:
SELECT
first_value(sub_id) OVER (
PARTITION BY customer_id, value_partition
ORDER BY customer_id, date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS parent_sub_id,
customer_id,
order_id
FROM (
SELECT
customer_id,
order_id,
sub_id,
date,
SUM(CASE WHEN sub_id IS NULL THEN 0 ELSE 1 END) OVER (
ORDER BY customer_id, date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS value_partition
from m.parent_child_lvl_1
where customer_id in ('227330109','90872199','102694972')
and
order_type in (
'sub_parent_first_order',
'sub_parent_order',
'sub_order'
)
ORDER BY date ASC
) AS q
ORDER BY customer_id, date
;
Which seems to work for some records, but then goes wrong for others and I can’t figure out why.
2
Answers
Your method doesn’t seem wrong but it might be handling strange cases in a way you didn’t expect it to:
The window definition in your subquery is missing a
partition by
. The fact thatcustomer_id
is on the first position of itsorder by
helps because different customers are never in the same peer group.Thing is, it’s overlooking the fact that higher
customer_id
‘s seeall preceding customers. This means a series of orders with a
leading
null
would fall back onsub_id
from some completelyunrelated customer with a lower identifier.
Your outer query window frame spec is too narrow. If there are multiple orders from the same customer and date, some
null
and some not, only one of them will have the full picture to be able to pick one of the not-null
values. You can fix it by expanding it tounbounded following
.The outer query window only needs to be ordered by
sub_id nulls last
. You already established sub-partitions with at most a single not-null
in them, now you just need to make sure it comes first if you wantfirst_value()
to pick it up.Even though Redshift Unsupported PostgreSQL features lists this:
I don’t think they mean
nulls last|first
because it’s clearly in Redshift window function syntax:So that might be about
nulls [not] distinct
. Still, if the clause were to get rejected,null
values should come last by default:Unlike in PostgreSQL, there’s also an option to
ignore nulls
in Redshift, which saves you the whole nested window function call.demo at db<>fiddle
You can do this sort of thing:
a bit more efficiently using an aggregate
filter
clause:But for this particular use case, there’s a dedicated variant of
count()
:The
count(expression)
is available in both PostgreSQL and Redshift but unfortunately Redshift doesn’t seem to support thefilter
clause. It offers aqualify
, but that filters things going out of the function, not what’s fed into it.Whenever your window has an
ORDER BY
, theROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is implied in PostgreSQL:Unfortunately, that’s not the case in Redshift:
Defining the window in its dedicated section instead of right in
over()
lets you re-use the same window definition multiple times and tidies up theselect
section a bit. You can also base other windows off of pre-defined windows: demo at db<>fiddleUnfortunately, Redshift doesn’t support it.
A shorter alternative, using Redshift
ignore nulls
option (not available in PostgreSQL):