skip to Main Content

I have the following dataset:

create schema m;
create table m.parent_child_lvl_1(customer_id,date,order_type,order_id,sub_id)
as values
 (108384372,'18/09/2023'::date,'sub_parent_first_order',5068371361861,407284605)
,(108384372, '13/11/2023', 'sub_order', 5134167539781, null)
,(108384372, '8/01/2024', 'sub_order', 5214687526981, null)
,(108384372, '4/03/2024', 'sub_order', 5283166126149, null)
,(108384372, '18/06/2024', 'sub_parent_order', 5421811138629, 500649255)
,(108384372, '12/08/2024', 'sub_order', 5508433641541, null)
,(108384372, '12/08/2024', 'sub_order', 5508433641541, null);

I need to fill the yellow null’s with the Sub_ID until it hits the next non-null Sub_ID, then fill until the next, and so on.
The null order only works based on the Date ordering.
Also, needs to be grouped by Customer_ID. The full dataset has 15M rows, so any speed optimisation would be a bonus.

I’ve tried lead(), lag(), lastvalue(), coalesce() but the closest I got was only filling in the first null value after each real value.

The closest I’ve come to is this:

SELECT
  first_value(sub_id) OVER (
    PARTITION BY customer_id, value_partition 
    ORDER BY customer_id, date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS parent_sub_id,
  customer_id,
  order_id
FROM (
  SELECT
customer_id,
order_id,
sub_id,
date,
    SUM(CASE WHEN sub_id IS NULL THEN 0 ELSE 1 END) OVER (
      ORDER BY customer_id, date 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS value_partition
from m.parent_child_lvl_1 
where customer_id in ('227330109','90872199','102694972')
and
order_type in (
'sub_parent_first_order',
'sub_parent_order',
'sub_order'
)
  ORDER BY date ASC
) AS q
ORDER BY customer_id, date
 ;

Which seems to work for some records, but then goes wrong for others and I can’t figure out why.

2

Answers


  1. Your method doesn’t seem wrong but it might be handling strange cases in a way you didn’t expect it to:

    1. The window definition in your subquery is missing a partition by. The fact that customer_id is on the first position of its order by helps because different customers are never in the same peer group.

      Thing is, it’s overlooking the fact that higher customer_id‘s see
      all preceding customers. This means a series of orders with a
      leading null would fall back on sub_id from some completely
      unrelated customer with a lower identifier.

    2. Your outer query window frame spec is too narrow. If there are multiple orders from the same customer and date, some null and some not, only one of them will have the full picture to be able to pick one of the not-null values. You can fix it by expanding it to unbounded following.

    3. The outer query window only needs to be ordered by sub_id nulls last. You already established sub-partitions with at most a single not-null in them, now you just need to make sure it comes first if you want first_value() to pick it up.
      Even though Redshift Unsupported PostgreSQL features lists this:

      NULLS clause in Window functions

      I don’t think they mean nulls last|first because it’s clearly in Redshift window function syntax:

      The order_list is as follows.

      expression | column_name [ ASC | DESC ] 
      [ NULLS FIRST | NULLS LAST ]
      [, order_list ]
      

      So that might be about nulls [not] distinct. Still, if the clause were to get rejected, null values should come last by default:

      By default, NULL values are sorted and ranked last in ASC ordering

      Unlike in PostgreSQL, there’s also an option to ignore nulls in Redshift, which saves you the whole nested window function call.

    demo at db<>fiddle

    select*,first_value(sub_id)
              over(partition by customer_id, value_partition 
                   order by sub_id nulls last
                   rows between unbounded preceding 
                            and unbounded following) as forward_filled_sub_id
    from(select customer_id
              , order_id
              , sub_id
              , date
              , count(sub_id)
                  over(partition by customer_id
                       order by date
                       rows between unbounded preceding 
                                and current row) as value_partition
         from m.parent_child_lvl_1 
         where customer_id in (  '108384372','222222222','333333333','444444444'
                               , '555555555','666666666','777777777','888888888' )
         and order_type in (  'sub_parent_first_order'
                            , 'sub_parent_order'
                            , 'sub_order')
        ) as subquery
    order by customer_id
           , date
           , order_id;
    
    1. You can do this sort of thing:

      SUM(CASE WHEN sub_id IS NULL THEN 0 ELSE 1 END)
      

      a bit more efficiently using an aggregate filter clause:

      count(*)filter(where sub_id is not null)
      

      But for this particular use case, there’s a dedicated variant of count():

      count(sub_id)
      

      count ( "any" ) → bigint
      Computes the number of input rows in which the input value is not null.

      The count(expression) is available in both PostgreSQL and Redshift but unfortunately Redshift doesn’t seem to support the filter clause. It offers a qualify, but that filters things going out of the function, not what’s fed into it.

    2. Whenever your window has an ORDER BY, the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implied in PostgreSQL:

      The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this sets the frame to be all rows from the partition start up through the current row’s last ORDER BY peer.

      Unfortunately, that’s not the case in Redshift:

      If an ORDER BY clause is used for an aggregate function, an explicit frame clause is required.

    3. Defining the window in its dedicated section instead of right in over() lets you re-use the same window definition multiple times and tidies up the select section a bit. You can also base other windows off of pre-defined windows: demo at db<>fiddle

      select item_id
           , item_category
           , sum(weight)over w1 as "total weight in its category"
           , sum(volume)over w1 as "total volume in its category"
           , sum(weight)over(w1 order by date) as "stepping weight in its category" 
      from test
      window w1 as (partition by item_category);
      

      Unfortunately, Redshift doesn’t support it.

    Login or Signup to reply.
  2. A shorter alternative, using Redshift ignore nulls option (not available in PostgreSQL):

    IGNORE NULLS
    The function returns the last value in the frame that is not NULL (or NULL if all values are NULL).

    select*,last_value(sub_id) ignore nulls
              over(partition by customer_id
                   order by date
                   rows between unbounded preceding 
                            and current_row) as forward_filled_sub_id
    from m.parent_child_lvl_1 
    where customer_id in (  '108384372','222222222','333333333','444444444'
                          , '555555555','666666666','777777777','888888888' )
      and order_type in (  'sub_parent_first_order'
                         , 'sub_parent_order'
                         , 'sub_order')
    order by customer_id
           , date
           , order_id;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search