skip to Main Content

Sample data below.

I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).

  • I can’t do lag, because in some cases there are consecutive nulls.
  • I can’t do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I’ve accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).

The only thing I can think of is to do

table_x as (select id, col_x from table where col_a is not null)

for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.

Appreciate any help!

row id col_a col_a_desired col_b col_b_desired
0 1 NYC red red
1 1 NYC NYC red red
2 1 SF SF blue
3 1 SF blue
4 1 SF SF blue blue
5 2 PAR PAR red red
6 2 LON LON blue
7 2 LON LON blue
8 2 LON blue blue
9 2 LON LON blue
10 2 LON blue

2

Answers


  1. I want to clean up data based on the next non-null value.

    So if you reverse the order, that’s the last non-null value.

    If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).

    The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.

    This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.

    Login or Signup to reply.
  2. Can you try this query?

    WITH samp AS (
      SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
      SELECT 1, 1, 'NYC', 'red' UNION ALL
      SELECT 2, 1, 'SF', NULL UNION ALL
      SELECT 3, 1, NULL, NULL UNION ALL
      SELECT 4, 1, 'SF', 'blue' UNION ALL
      SELECT 5, 2, 'PAR', 'red' UNION ALL
      SELECT 6, 2, 'LON', NULL UNION ALL
      SELECT 7, 2, 'LON', NULL UNION ALL
      SELECT 8, 2, NULL, 'blue' UNION ALL
      SELECT 9, 2, 'LON', NULL UNION ALL
      SELECT 10, 2, NULL, NULL
    )
      SELECT
      row_id,
      id,
      IFNULL(FIRST_VALUE(col_a IGNORE NULLS) 
        OVER (PARTITION BY id ORDER BY row_id
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
      FIRST_VALUE(col_a IGNORE NULLS) 
        OVER (PARTITION BY id ORDER BY row_id desc
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
      IFNULL(FIRST_VALUE(col_b IGNORE NULLS) 
        OVER (PARTITION BY id ORDER BY row_id
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
      FIRST_VALUE(col_b IGNORE NULLS) 
        OVER (PARTITION BY id ORDER BY row_id desc
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
     from samp order by id, row_id
    

    Output:
    enter image description here

    References:
    https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
    https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search