Postgresql - Select most recent non-null value for multiple columns

Gandalf
December 30, 2023
263 views
0 votes
2 Answers

I’ve been stumped trying to optimize this query and was hoping some of you database wizards might have some insight. Here is the setup.

Using TimescaleDB as my database, I have a wide table containing sensor data, it looks like the below:

time	sensor_id	wind_speed	wind_direction
‘2023-12-18 12:15:00’	‘1’	NULL	176
‘2023-12-18 12:13:00’	‘1’	4	177
‘2023-12-18 12:11:00’	‘1’	3	NULL
‘2023-12-18 12:09:00’	‘1’	8	179

I want to write a query which gives me the most recent non-null value for a set of columns, filtered on sensor_id. For the above data (filtering on sensor_id 1), this query should return

wind_speed	wind_direction
4	176

With that being said, my query looks like the below (when querying for sensor_ids in batches of 10):

SELECT
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,

    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
    .
    .
    .
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;

The table I am querying against has 1,000 unique sensor_ids, all of which report data at a 2 minute interval. Hence, we are talking 100s of millions of rows.

I’ve created an index on (sensor_id, time DESC) to further optimize the query. With the index, this query is taking roughly 400ms and 50ms planning and execution time respectively.

How can I write the query differently (or add indexes) to achieve optimal planning and execution time?

Answers

- ErwinBrandstetter
- December 19, 2023 at 5:10 am
- 0 votes
0
Unfortunately, Postgres does not (yet, as of pg 16) implement IGNORE NULLS for window functions. That would allow a simple call of first_value() for each value column. See:
- How to use first_value() window function to get last non null value?
- Select a row of first non-null values in a sparse table
Solutions

fiddle

There are various shorter and possibly (much) faster options.
You should at least have a (partial) index on (ts). Possibly on (sensor_id, ts). Or more. See below. All depending on undisclosed details.

I find the name "time" for a timestamp column misleading. Using "ts" instead.

first_value() + DISTINCT ON

A shorter drop-in replacement.
```
SELECT DISTINCT ON (sensor_id)
       sensor_id
     , first_value(wind_speed    ) OVER (w ORDER BY wind_speed     IS NULL, ts DESC) AS wind_speed
     , first_value(wind_direction) OVER (w ORDER BY wind_direction IS NULL, ts DESC) AS wind_direction
--   , ... more?
FROM   sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
WINDOW w AS (PARTITION BY sensor_id);
```
About DISTINCT ON:
- Select first row in each GROUP BY group?
count() window function in subquery + filtered aggregate in main
```
SELECT sensor_id
     , min(wind_speed)     FILTER (WHERE ws_ct = 1) AS wind_speed
     , min(wind_direction) FILTER (WHERE wd_ct = 1) AS wind_direction
--   , ... more?
FROM  (
   SELECT *
        , count(wind_speed)     OVER w AS ws_ct
        , count(wind_direction) OVER w AS wd_ct
   --   ,  ... more?
   FROM   sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   WINDOW w AS (PARTITION BY sensor_id ORDER BY ts DESC)
   ) sub
GROUP  BY sensor_id;
```
See:
- Aggregate columns with additional (distinct) filters
- Carry over long sequence of missing values with Postgres
Simpler based on "sensor" table

If you also have a table "sensor" with one row per relevant sensor_id (like you probably should), it gets simpler:
```
SELECT sensor_id
    , (SELECT wind_speed     FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_speed     IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_speed
    , (SELECT wind_direction FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_direction IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_direction
--  , ... more?
FROM   sensor s
    , (SELECT LOCALTIMESTAMP - interval '7 days') t(ts_min)
;
```
The last query (like your verbose original) can use customized indexes. Ideally, partial indexes – while there are many rows per sensor, few value columns, many null values and many outdated rows.
```
CREATE INDEX sensor_data_wind_speed_idx     ON sensor_data (sensor_id, ts DESC, wind_speed)
WHERE  wind_speed IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

CREATE INDEX sensor_data_wind_direction_idx ON sensor_data (sensor_id, ts DESC, wind_direction)
WHERE  wind_direction IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!
```
Use a constant that’s one week in the past at creation time. The index grows in size over time, but stays applicable. Recreate indexes with later cut-off from time to time to keep the size at bay. (Not sure if the timestamp bound pays for your hypertables, though. Plain indexes may be good enough. I had plain Postgres in mind.)

Then run the same query, but with a constant timestamp:
```
SELECT ...
FROM   sensor s
    , (SELECT timestamp '2023-12-12 03:47:16') t(ts_min)  -- MUST be a constant to use partial index!
;
```
Sorted subquery + first() aggregate function

If index-support is not an option or not efficient, the most convenient query would be with the aggregate function first() – probably fastest, too, if you use the C version from the additional module first_last_agg. See:
- Get values from first and last row per group
Required once per DB:
```
CREATE EXTENSION first_last_agg;
```
```
SELECT sensor_id
     , first(wind_speed    ) FILTER (WHERE wind_speed IS NOT NULL)     AS wind_speed
     , first(wind_direction) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
--   , ... more?
FROM   (
   SELECT * FROM sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   ORDER  BY sensor_id, ts DESC
   ) s
GROUP  BY 1;
```
Login or Signup to reply.

- James
- December 20, 2023 at 10:50 pm
- 0 votes
0
Expanding the solution you chose from the amazing list given by @ErwinBrandstetter

Because you’re using TimescaleDB you don’t actually need the first_last_agg extension because you already have a (slightly different) first agg.

That query can actually be simplifed down to:
```
SELECT sensor_id,
      last(wind_speed,ts) FILTER (WHERE wind_speed IS NOT NULL) AS wind_speed
    , last(wind_direction,ts) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
FROM  sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
GROUP BY 1;
```
Based on your feedback about your original planning time being 400ms I do wonder how many chunks your Timescale hypertable has? I think you could probably optimize here!

Another avenue for optimization is to compress this data. When I did a test I dropped my storage required for my data by 8x my query speed (for the query above) by 3x.

I compressed segmenting by sensor_id and ordering by time DESC, wind_speed, wind_direction.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Select most recent non-null value for multiple columns

Answers

Solutions

`first_value()` + `DISTINCT ON`

`count()` window function in subquery + filtered aggregate in main

Simpler based on "sensor" table

Sorted subquery + `first()` aggregate function

Postgresql – Select most recent non-null value for multiple columns

Answers

Solutions

first_value() + DISTINCT ON

count() window function in subquery + filtered aggregate in main

Simpler based on "sensor" table

Sorted subquery + first() aggregate function

`first_value()` + `DISTINCT ON`

`count()` window function in subquery + filtered aggregate in main

Sorted subquery + `first()` aggregate function