skip to Main Content

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ("time", meter_id)
)

I would like to get the latest row in a given time interval, over a period of time.
For instance the latest record each month for the previous year.
The following query works but is slow:

EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC
Unique  (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
        Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
              Rows Removed by Filter: 24330
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
  Functions: 217
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms

Getting the latest row for a single month is instantaneous:

EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1
Limit  (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
        Order: electricity_data.""time"" DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
        ->  Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk  (cost=0.42..8.94 rows=174 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms

Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?

Edit

The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.

EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
    AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
    AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
(... plan removed)
Planning Time: 50.463 ms
JIT:
  Functions: 451
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms

3

Answers


  1. Chosen as BEST ANSWER

    The other answers are likely more useful in most cases. I wanted a solution that works for any interval, without the need for continuous aggregates.

    I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.

    SELECT bucket, import_low, import_normal, export_low, export_normal
    FROM (
      SELECT
      tstzrange(
        -- Could also use date_trunc or date_bin
        time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
        time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
        '(]'  -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
      ) bucket
      FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
    ) buckets
    LEFT JOIN LATERAL (
      SELECT *
      FROM electricity_data
      WHERE meter_id = $meterId AND "time" <@ bucket
      ORDER BY "time" DESC
      LIMIT 1
    ) elec ON true
    ORDER BY bucket;
    

  2. You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.

            SELECT meter_id, MAX("time") "time"
              FROM electricity_data
              WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
                AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
              GROUP BY meter_id, 
                       time_bucket('1 month', "time", 'Europe/Amsterdam')
    

    That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.

    CREATE INDEX meter_time ON electricity_data (meter_id, "time")
    

    Then join that to your detail table. Like this.

    SELECT d.meter_id
           time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
           d."time",
           d.import_low,
           d.import_normal,
           d.export_low,
           d.export_normal
      FROM electricity_data d
      JOIN (
            SELECT meter_id, MAX("time") "time"
              FROM electricity_data
              WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
                AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
              GROUP BY meter_id, 
                       time_bucket('1 month', "time", 'Europe/Amsterdam')
           ) last ON d."time" = last."time" 
                 AND d.meter_id = last.meter_id
     ORDER BY d.meter_id, bucket DESC
    

    (I’m not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn’t tested.)

    If you want just one meter, put a WHERE clause right before the last ORDER BY clause.

    Login or Signup to reply.
  3. I’d recommend using the last aggregate and a continuous aggregate to solve this problem.

    Like the previous poster, I’d also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.

    CREATE TABLE electricity_data
    (
        "time" timestamptz NOT NULL,
        meter_id integer REFERENCES meters NOT NULL,
        import_low double precision,
        import_normal double precision,
        export_low double precision,
        export_normal double precision,
        PRIMARY KEY ( meter_id, "time")
    );
    

    But that’s a bit off topic. The basic query you’ll want to do is something like:

    SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'), 
        meter_id, 
        last(electricity_data, "time") 
    FROM electricity_data 
    GROUP BY 1, 2;
    

    This is a bit confusing until you realize that the table itself is also a type in PostgreSQL – so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.

    Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.

    SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
        meter_id, 
        (last(electricity_data, "time")).*
    FROM electricity_data 
    GROUP BY 1,2;
    

    Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.

    CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
    SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
        (last(electricity_data, "time")).*
    FROM electricity_data 
    GROUP BY 1, meter_id;
    

    You’ll note that I took the meter_id out of the initial select list because that’s gonna come from our composite type and I don’t need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.

    So that’ll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.

    CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
    SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
        (last(electricity_data, "time")).*
    FROM electricity_data 
    GROUP BY 1, meter_id;
    
    CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
    SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
        (last(last_meter_day, time_bucket)).*
    FROM last_meter_day 
    GROUP BY 1, meter_id;
    

    The reason for that is that we can’t really refresh a monthly continuous aggregate all that often, it’s much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won’t be as performant.

    You’ll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.

    I’d also suggest, depending on what you’re trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you’re processing this data.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search