Sample data below.
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).
- I can’t do lag, because in some cases there are consecutive nulls.
- I can’t do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I’ve accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).
The only thing I can think of is to do
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.
Appreciate any help!
row | id | col_a | col_a_desired | col_b | col_b_desired |
---|---|---|---|---|---|
0 | 1 | – | NYC | red | red |
1 | 1 | NYC | NYC | red | red |
2 | 1 | SF | SF | – | blue |
3 | 1 | – | SF | – | blue |
4 | 1 | SF | SF | blue | blue |
5 | 2 | PAR | PAR | red | red |
6 | 2 | LON | LON | – | blue |
7 | 2 | LON | LON | – | blue |
8 | 2 | – | LON | blue | blue |
9 | 2 | LON | LON | – | blue |
10 | 2 | – | LON | – | blue |
2
Answers
So if you reverse the order, that’s the last non-null value.
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.
Can you try this query?
Output:
References:
https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls