I have a table containing weather information for various locations. Each location is identified by an identifier station_id
, and each kind of information is identified by a string (e.g. wind_speed
, humidity
,…).
Here is a simplified example. first_time
and last_time
are two timestamps, I will explain later what they are for.
first_time | last_time | station_id | name | value |
---|---|---|---|---|
10:00 | 10:10 | 1 | wind_speed | 12 |
10:15 | 10:40 | 1 | wind_speed | 9 |
10:20 | 10:25 | 1 | humidity | 43 |
10:25 | 10:40 | 2 | wind_speed | 2 |
10:45 | 10:50 | 1 | wind_speed | 10 |
Each couple (station_id, name)
defines a time series. If we focus on the couple (1, wind_speed)
, we see that it is measured every 5 minutes. At 10:00, 10:05, and 10:10 it has been measured to be 12 m/s (units are implicit), then from 10:15 to 10:40 it dropped to 9 m/s, then at 10:45 and 10:50 the measurement read 10.
The fact that not each measurement results in a new record is an optimization, which allows to keep the table dramatically smaller than it could be (some of these variables change very rarely or never at all). We keep two timestamps to identify the first timestamp where that property had a given value, and the last measurement when it did.
When we want to insert a new measurement result in a the table, we look at the last value present in the table with the same station ID and property name, and if its value differs, we add a new record. Otherwise we update the last_time
of the record that exists already. We call this operation a "merge", even though it differs from the SQL MERGE query.
Here is the question: how do we perform this operation safely and efficiently? We currently perform two queries (a SELECT and an UPDATE or an INSERT). The operation we want to perform is similar to UPSERT, but not quite the same. The (station_id, name)
index does not have any unique constraint. Is there any way to perform this operation in a single query in PostgreSQL?
2
Answers
In a
MERGE
join condition suggested by @Adrian Klaver, you can add a subquery to only look for matches against the latest recod for a givenstation_id
andname
: demo at db<>fiddleThat
wind_speed
is the same as the latest, so it gets merged. Thehumidity
changed, so it’s added as a separate entry. I’m using the built-in columnctid
to uniquely identify the latest row.It helps to
index
and possiblycluster
the table – you definitely don’t want to seq scan for the latest data on each insert.Split the table in two: one for your already "compressed" history data and a separate one for current measurements, holding a unique row for each
station_id
and measurementname
. This lets youinsert...on conflict do update
against the one with current data, since it requires the constraint. When you get the samevalue
again, it just bumps up thelast_time
incurrent_measurements
. A triger can see when a new, differentvalue
appears and move the old one to the history table, so that the new one takes its place as the current. To look at all measurements, you can set up a view that connects them with aunion all
:This setup simplifies the search for the latest entry – it’s always in the small and easy to navigate table with all unique entries. The only index that’s strictly required is the tiny
unique
on thecurrent_measurements
that guides the upsert.I would try to make it a bit simpler. Assuming that
last_time
can be calculated later in a query usinglead
window function then you need not update or merge the latest record for(first_time, station_id, name)
if the value is the same but skip the event instead. Here is a parameterized query:I would also suggest that you remove column
last_time
from the table.You will need the proper index (station_id, name, first_time) too.
A fiddle to experiment with