skip to Main Content

I’ve been stumped trying to optimize this query and was hoping some of you database wizards might have some insight. Here is the setup.

Using TimescaleDB as my database, I have a wide table containing sensor data, it looks like the below:

time sensor_id wind_speed wind_direction
‘2023-12-18 12:15:00’ ‘1’ NULL 176
‘2023-12-18 12:13:00’ ‘1’ 4 177
‘2023-12-18 12:11:00’ ‘1’ 3 NULL
‘2023-12-18 12:09:00’ ‘1’ 8 179

I want to write a query which gives me the most recent non-null value for a set of columns, filtered on sensor_id. For the above data (filtering on sensor_id 1), this query should return

wind_speed wind_direction
4 176

With that being said, my query looks like the below (when querying for sensor_ids in batches of 10):

SELECT
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,

    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
    .
    .
    .
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;

The table I am querying against has 1,000 unique sensor_ids, all of which report data at a 2 minute interval. Hence, we are talking 100s of millions of rows.

I’ve created an index on (sensor_id, time DESC) to further optimize the query. With the index, this query is taking roughly 400ms and 50ms planning and execution time respectively.

How can I write the query differently (or add indexes) to achieve optimal planning and execution time?

2

Answers


  1. Unfortunately, Postgres does not (yet, as of pg 16) implement IGNORE NULLS for window functions. That would allow a simple call of first_value() for each value column. See:

    Solutions

    fiddle

    There are various shorter and possibly (much) faster options.
    You should at least have a (partial) index on (ts). Possibly on (sensor_id, ts). Or more. See below. All depending on undisclosed details.

    I find the name "time" for a timestamp column misleading. Using "ts" instead.

    first_value() + DISTINCT ON

    A shorter drop-in replacement.

    SELECT DISTINCT ON (sensor_id)
           sensor_id
         , first_value(wind_speed    ) OVER (w ORDER BY wind_speed     IS NULL, ts DESC) AS wind_speed
         , first_value(wind_direction) OVER (w ORDER BY wind_direction IS NULL, ts DESC) AS wind_direction
    --   , ... more?
    FROM   sensor_data
    WHERE  ts > LOCALTIMESTAMP - interval '7 days'
    WINDOW w AS (PARTITION BY sensor_id);
    

    About DISTINCT ON:

    count() window function in subquery + filtered aggregate in main

    SELECT sensor_id
         , min(wind_speed)     FILTER (WHERE ws_ct = 1) AS wind_speed
         , min(wind_direction) FILTER (WHERE wd_ct = 1) AS wind_direction
    --   , ... more?
    FROM  (
       SELECT *
            , count(wind_speed)     OVER w AS ws_ct
            , count(wind_direction) OVER w AS wd_ct
       --   ,  ... more?
       FROM   sensor_data
       WHERE  ts > LOCALTIMESTAMP - interval '7 days'
       WINDOW w AS (PARTITION BY sensor_id ORDER BY ts DESC)
       ) sub
    GROUP  BY sensor_id;
    

    See:

    Simpler based on "sensor" table

    If you also have a table "sensor" with one row per relevant sensor_id (like you probably should), it gets simpler:

    SELECT sensor_id
        , (SELECT wind_speed     FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_speed     IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_speed
        , (SELECT wind_direction FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_direction IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_direction
    --  , ... more?
    FROM   sensor s
        , (SELECT LOCALTIMESTAMP - interval '7 days') t(ts_min)
    ;
    

    The last query (like your verbose original) can use customized indexes. Ideally, partial indexes – while there are many rows per sensor, few value columns, many null values and many outdated rows.

    CREATE INDEX sensor_data_wind_speed_idx     ON sensor_data (sensor_id, ts DESC, wind_speed)
    WHERE  wind_speed IS NOT NULL
    AND    ts > '2023-12-12 00:00';  -- constant!
    
    CREATE INDEX sensor_data_wind_direction_idx ON sensor_data (sensor_id, ts DESC, wind_direction)
    WHERE  wind_direction IS NOT NULL
    AND    ts > '2023-12-12 00:00';  -- constant!
    

    Use a constant that’s one week in the past at creation time. The index grows in size over time, but stays applicable. Recreate indexes with later cut-off from time to time to keep the size at bay. (Not sure if the timestamp bound pays for your hypertables, though. Plain indexes may be good enough. I had plain Postgres in mind.)

    Then run the same query, but with a constant timestamp:

    SELECT ...
    FROM   sensor s
        , (SELECT timestamp '2023-12-12 03:47:16') t(ts_min)  -- MUST be a constant to use partial index!
    ;
    

    Sorted subquery + first() aggregate function

    If index-support is not an option or not efficient, the most convenient query would be with the aggregate function first() – probably fastest, too, if you use the C version from the additional module first_last_agg. See:

    Required once per DB:

    CREATE EXTENSION first_last_agg;
    
    SELECT sensor_id
         , first(wind_speed    ) FILTER (WHERE wind_speed IS NOT NULL)     AS wind_speed
         , first(wind_direction) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
    --   , ... more?
    FROM   (
       SELECT * FROM sensor_data
       WHERE  ts > LOCALTIMESTAMP - interval '7 days'
       ORDER  BY sensor_id, ts DESC
       ) s
    GROUP  BY 1;
    
    Login or Signup to reply.
  2. Expanding the solution you chose from the amazing list given by @ErwinBrandstetter

    Because you’re using TimescaleDB you don’t actually need the first_last_agg extension because you already have a (slightly different) first agg.

    That query can actually be simplifed down to:

    SELECT sensor_id,
          last(wind_speed,ts) FILTER (WHERE wind_speed IS NOT NULL) AS wind_speed
        , last(wind_direction,ts) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
    FROM  sensor_data
    WHERE  ts > LOCALTIMESTAMP - interval '7 days'
    GROUP BY 1;
    

    Based on your feedback about your original planning time being 400ms I do wonder how many chunks your Timescale hypertable has? I think you could probably optimize here!

    Another avenue for optimization is to compress this data. When I did a test I dropped my storage required for my data by 8x my query speed (for the query above) by 3x.

    I compressed segmenting by sensor_id and ordering by time DESC, wind_speed, wind_direction.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search