skip to Main Content

I have a table containing weather information for various locations. Each location is identified by an identifier station_id, and each kind of information is identified by a string (e.g. wind_speed, humidity,…).

Here is a simplified example. first_time and last_time are two timestamps, I will explain later what they are for.

first_time last_time station_id name value
10:00 10:10 1 wind_speed 12
10:15 10:40 1 wind_speed 9
10:20 10:25 1 humidity 43
10:25 10:40 2 wind_speed 2
10:45 10:50 1 wind_speed 10

Each couple (station_id, name) defines a time series. If we focus on the couple (1, wind_speed), we see that it is measured every 5 minutes. At 10:00, 10:05, and 10:10 it has been measured to be 12 m/s (units are implicit), then from 10:15 to 10:40 it dropped to 9 m/s, then at 10:45 and 10:50 the measurement read 10.

The fact that not each measurement results in a new record is an optimization, which allows to keep the table dramatically smaller than it could be (some of these variables change very rarely or never at all). We keep two timestamps to identify the first timestamp where that property had a given value, and the last measurement when it did.

When we want to insert a new measurement result in a the table, we look at the last value present in the table with the same station ID and property name, and if its value differs, we add a new record. Otherwise we update the last_time of the record that exists already. We call this operation a "merge", even though it differs from the SQL MERGE query.

Here is the question: how do we perform this operation safely and efficiently? We currently perform two queries (a SELECT and an UPDATE or an INSERT). The operation we want to perform is similar to UPSERT, but not quite the same. The (station_id, name) index does not have any unique constraint. Is there any way to perform this operation in a single query in PostgreSQL?

2

Answers


    1. In a MERGE join condition suggested by @Adrian Klaver, you can add a subquery to only look for matches against the latest recod for a given station_id and name: demo at db<>fiddle

      merge into measurements current
      using (values ('today 10:56'::timestamp,'today 10:56'::timestamp,1,'wind_speed',10)
                   ,('today 10:56'::timestamp,'today 10:56'::timestamp,1,'humidity',46) )
      as incoming(first_time,last_time,station_id,name_,value_)
      on ( (current.station_id,current.name_,current.value_)
          =(incoming.station_id,incoming.name_,incoming.value_)
        and current.ctid=(select ctid from measurements as m2
                          where (m2.station_id,m2.name_,m2.value_)
                               =(incoming.station_id,incoming.name_,incoming.value_)
                          order by last_time desc limit 1) )
      when matched then update set
        last_time=incoming.last_time
      when not matched then insert
        values (incoming.first_time,incoming.last_time,incoming.station_id,incoming.name_,incoming.value_);
      

      That wind_speed is the same as the latest, so it gets merged. The humidity changed, so it’s added as a separate entry. I’m using the built-in column ctid to uniquely identify the latest row.
      It helps to index and possibly cluster the table – you definitely don’t want to seq scan for the latest data on each insert.

    2. Split the table in two: one for your already "compressed" history data and a separate one for current measurements, holding a unique row for each station_id and measurement name. This lets you insert...on conflict do update against the one with current data, since it requires the constraint. When you get the same value again, it just bumps up the last_time in current_measurements. A triger can see when a new, different value appears and move the old one to the history table, so that the new one takes its place as the current. To look at all measurements, you can set up a view that connects them with a union all:

      create view all_measurements as 
        table history_measurements union all
        table current_measurements;
      

      This setup simplifies the search for the latest entry – it’s always in the small and easy to navigate table with all unique entries. The only index that’s strictly required is the tiny unique on the current_measurements that guides the upsert.

    Login or Signup to reply.
  1. I would try to make it a bit simpler. Assuming that last_time can be calculated later in a query using lead window function then you need not update or merge the latest record for (first_time, station_id, name) if the value is the same but skip the event instead. Here is a parameterized query:

    insert into readings (first_time, station_id, "name", "value")
    select :first_time, :station_id, :name, :value
    where (
      select "value" from readings
      where first_time = :first_time 
        and station_id = :station_id
        and "name" = :name
      order by first_time desc limit 1
     ) <> :value;
    

    I would also suggest that you remove column last_time from the table.
    You will need the proper index (station_id, name, first_time) too.
    A fiddle to experiment with

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search