I’ve been stumped trying to optimize this query and was hoping some of you database wizards might have some insight. Here is the setup.
Using TimescaleDB as my database, I have a wide table containing sensor data, it looks like the below:
time | sensor_id | wind_speed | wind_direction |
---|---|---|---|
‘2023-12-18 12:15:00’ | ‘1’ | NULL | 176 |
‘2023-12-18 12:13:00’ | ‘1’ | 4 | 177 |
‘2023-12-18 12:11:00’ | ‘1’ | 3 | NULL |
‘2023-12-18 12:09:00’ | ‘1’ | 8 | 179 |
I want to write a query which gives me the most recent non-null value for a set of columns, filtered on sensor_id
. For the above data (filtering on sensor_id
1), this query should return
wind_speed | wind_direction |
---|---|
4 | 176 |
With that being said, my query looks like the below (when querying for sensor_id
s in batches of 10):
SELECT
(SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
(SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,
(SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
(SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
.
.
.
(SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
(SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;
The table I am querying against has 1,000 unique sensor_id
s, all of which report data at a 2 minute interval. Hence, we are talking 100s of millions of rows.
I’ve created an index on (sensor_id, time DESC)
to further optimize the query. With the index, this query is taking roughly 400ms and 50ms planning and execution time respectively.
How can I write the query differently (or add indexes) to achieve optimal planning and execution time?
2
Answers
Unfortunately, Postgres does not (yet, as of pg 16) implement
IGNORE NULLS
for window functions. That would allow a simple call offirst_value()
for each value column. See:Solutions
fiddle
There are various shorter and possibly (much) faster options.
You should at least have a (partial) index on
(ts)
. Possibly on(sensor_id, ts)
. Or more. See below. All depending on undisclosed details.I find the name "time" for a
timestamp
column misleading. Using "ts" instead.first_value()
+DISTINCT ON
A shorter drop-in replacement.
About
DISTINCT ON
:count()
window function in subquery + filtered aggregate in mainSee:
Simpler based on "sensor" table
If you also have a table "sensor" with one row per relevant
sensor_id
(like you probably should), it gets simpler:The last query (like your verbose original) can use customized indexes. Ideally, partial indexes – while there are many rows per sensor, few value columns, many null values and many outdated rows.
Use a constant that’s one week in the past at creation time. The index grows in size over time, but stays applicable. Recreate indexes with later cut-off from time to time to keep the size at bay. (Not sure if the timestamp bound pays for your hypertables, though. Plain indexes may be good enough. I had plain Postgres in mind.)
Then run the same query, but with a constant timestamp:
Sorted subquery +
first()
aggregate functionIf index-support is not an option or not efficient, the most convenient query would be with the aggregate function
first()
– probably fastest, too, if you use the C version from the additional modulefirst_last_agg
. See:Required once per DB:
Expanding the solution you chose from the amazing list given by @ErwinBrandstetter
Because you’re using TimescaleDB you don’t actually need the
first_last_agg
extension because you already have a (slightly different)first
agg.That query can actually be simplifed down to:
Based on your feedback about your original planning time being 400ms I do wonder how many chunks your Timescale hypertable has? I think you could probably optimize here!
Another avenue for optimization is to compress this data. When I did a test I dropped my storage required for my data by 8x my query speed (for the query above) by 3x.
I compressed segmenting by sensor_id and ordering by time DESC, wind_speed, wind_direction.