skip to Main Content

I’m trying fill NULL values in multiple columns (different column types INT, VARCHAR) with previous NOT NULL value in a group ordered by date. Considering following table:

CREATE TABLE IF NOT EXISTS test (
   id VARCHAR,
   date DATE,
   value_1 INT,
   value_2 VARCHAR
);


INSERT INTO test VALUES 
(1, '2022-01-04', 5, 'asdf'),
(1, '2022-01-03', NULL, NULL),
(1, '2022-01-02', NULL, 'def'),
(1, '2022-01-01', 4, NULL),
(2, '2022-01-04', 1, 'a'),
(2, '2022-01-03', NULL, NULL),
(2, '2022-01-02', 2, 'b'),
(2, '2022-01-01', NULL, NULL);

enter image description here

I want to get here (please consider value_1 –> INTEGER, value_2 string):

enter image description here

4

Answers


  1. One day, PostgreSQL may support the IGNORE NULLS option for LEAD and LAG functions.

    In the mean time, you must use window functions to build groups, then select the maximum in each group.

    SELECT id, date,
           MAX(value_1) OVER (PARTITION BY id, grp_1) AS value_1,
           MAX(value_2) OVER (PARTITION BY id, grp_2) AS value_2
    FROM(
        SELECT *,
              COUNT(value_1) OVER (PARTITION BY id ORDER BY Date DESC) as grp_1,
              COUNT(value_2) OVER (PARTITION BY id ORDER BY Date DESC) as grp_2
        FROM test
    ) T
    ORDER BY ID, date
    

    Explanation:

    COUNT(value_1) OVER (PARTITION BY id ORDER BY Date ASC/DESC) is a cumulative count that only increases when value_1 is not null. The result is, for each consecutive value it takes (1, 2, 3), there will be 1 record where value_1 is not null, and 0, 1 or several records where the field is null.

    Example:

    1. First record encountered (following the ORDER BY Date specified in the window);
      value_1 is null => count is 0.
    2. Next record, value_1 is not null => count is 1.
    3. Next record, value_1 is null => count stays at 1.
    4. Next record, value_1 is null => count stays at 1.
    5. Next record, value_1 is not null => count is 2.
      … and so on.

    Using this pattern, the MAX(value_1) OVER (PARTITION BY id, grp_1) works on this partition of record by taking the non-null value out of each group.

    1. Group [count = 0] has only 1 record with null => The max is null.
    2. Group [count = 1] has 1 non null and 2 null values => The max is the non null value.
    3. Group [count = 2] has 1 non null value (and an unspecified number of null values) => The max is the non null value.

    The pattern keeps going; for every value of count (except 0), the max window function always has exactly 1 non null value to pick.


    Edit:
    The above query is the answer to the original version of the question. Now that dates have been changed to be in the exact opposite order as before, the 3 occurrences of ORDER BY date must all be inverted to match. Basically, 2 negatives make a positive.

    SELECT id, date,
           MAX(value_1) OVER (PARTITION BY id, grp_1) AS value_1,
           MAX(value_2) OVER (PARTITION BY id, grp_2) AS value_2
    FROM(
        SELECT *,
              COUNT(value_1) OVER (PARTITION BY id ORDER BY Date ASC) as grp_1,
              COUNT(value_2) OVER (PARTITION BY id ORDER BY Date ASC) as grp_2
        FROM test
    ) T
    ORDER BY ID, date DESC
    
    Login or Signup to reply.
  2. PostgreSQL distinct on can help whenever your spec says you want one, top record in a group. You can use it in a scalar subquery in an update: online demo

    update test t1 
    set value_1=(select distinct on (t2.id) value_1--The top value_1 (latest by date)
                 from test t2                      --that's in the same table   ^
                 where t2.date < t1.date           --in an earlier row          |
                 and   t2.id   = t1.id             --for the same id            |
                 and   t2.value_1 is not null      --and isn't a null.          |
                 order by t2.id, date desc)        --The top meaning the latest_/
    where t1.value_1 is null;
    

    Since all it does is go to a row with a null and from its context, point back at a row you want to borrow the value from, it’s fairly intuitive.

    Login or Signup to reply.
  3. Using windowed function:

    • ARRAY_AGG building array with defined order
    • ARRAY_REMOVE removing NULLs
    • getting the last item in the array (cte/or replicating entire query) – unfortunately no way to access it as (array)[-1]

    Query version readable:

    WITH cte AS (
      SELECT *,
        ARRAY_REMOVE((ARRAY_AGG(value_1) OVER(PARTITION BY id ORDER BY date)), NULL) 
           AS value_1_arr,
        ARRAY_REMOVE((ARRAY_AGG(value_2) OVER(PARTITION BY id ORDER BY date)), NULL) 
           AS value_2_arr
      FROM test
    )
    SELECT id, date,
        value_1_arr[array_upper(value_1_arr, 1)] AS value_1,
        value_2_arr[array_upper(value_2_arr, 1)] AS value_2
    FROM cte
    ORDER BY id, date DESC;
    

    Debug (array_is_build, it is a matter of taking last item)

    WITH cte AS (
      SELECT *,
        ARRAY_REMOVE((ARRAY_AGG(value_1) OVER(PARTITION BY id ORDER BY date)), NULL) 
           AS value_1_arr,
        ARRAY_REMOVE((ARRAY_AGG(value_2) OVER(PARTITION BY id ORDER BY date)), NULL) 
           AS value_2_arr
      FROM test
    )
    SELECT *
    FROM cte
    ORDER BY id, date DESC;
    

    enter image description here


    Single query (lack of [-1] and duplicating the code makes it "ugly"

    SELECT id, date,   
      (ARRAY_REMOVE((ARRAY_AGG(value_1) OVER(PARTITION BY id ORDER BY date)), NULL))
      [array_upper(ARRAY_REMOVE((ARRAY_AGG(value_1) OVER(PARTITION BY id ORDER BY date)), NULL),1)]
     AS value_1, 
      (ARRAY_REMOVE((ARRAY_AGG(value_2) OVER(PARTITION BY id ORDER BY date)), NULL))
      [array_upper(ARRAY_REMOVE((ARRAY_AGG(value_2) OVER(PARTITION BY id ORDER BY date)), NULL),1)]
     AS value_2
    FROM test
    ORDER BY id, date DESC;
    

    Output:

    enter image description here

    db<>fiddle demo

    Login or Signup to reply.
  4. Short window-functions alternative: online demo

    SELECT 
      id
    , date
    ,(JSONB_AGG(value_1)FILTER(WHERE value_1 IS NOT NULL)OVER w->>-1)::int AS value_1
    , JSONB_AGG(value_2)FILTER(WHERE value_2 IS NOT NULL)OVER w->>-1    AS    value_2
    FROM test
    WINDOW w AS (PARTITION BY id ORDER BY date) 
    ORDER BY id, date DESC;
    

    The principle remains the same as in Lukasz’ and Atmo’s answers: it constructs a frame looking back at value_Ns from previous dates for the same ID, gets rid of NULL‘s then picks the latest non-NULL.

    It’s just shortened and cleaned up by having:

    1. The window defined once and re-used.

    2. An aggregate filter, to skip NULLs instead of collecting and having to remove them.

    3. A jsonb array. Unlike a regular array it allows for negative subscripting, eliminating the need for a subquery, a CTE or duplicating the code in order to call array_upper() and get the last element. That’s at the price of having to cast non-text types.

      Negative index is equal to positive on a reversed array, so the same can be done with a regular array as long as its constructed in reverse, by inverting the frame clause and order: demo

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search