Description
So, I’m working on a project that stores sensor measurements from multiple devices in PostgreSQL+TimescaleDB database.
The structure of the table (hypertable):
column_name | type | comment |
---|---|---|
identifier | text | device identifier |
key | text | name of the metric |
value_num | double precision | numeric measurement value |
value_text | text | text measurement value |
timestamp | timestamp with time zone | timestamp of the measurement |
Table has indexes on (identifier, timestamp)
and (identifier, key, timestamp)
.
Measurement value
The measurement can have measurement value in either value_num
or value_text
column depending on the measurement type.
Metric types
Each device can have different metrics. For example one device (FOO
) might have:
temperature_air
(withvalue_num
as that metric has numeric measurement)current_program_identifier
(withvalue_text
as that metric has text measurement)
and other device (BAR
) might have:
temperature_water
(withvalue_num
as that metric has numeric measurement)water_level
(withvalue_num
as that metric has numeric measurement)current_program_identifier
(withvalue_text
as that metric has text measurement)
Now I want to have a query, or, better yet, materialized view, that would show me the most recent measurements of all metrics grouped by device. Meaning, that I would expect to have something like:
device | temperature_air | temperature_water | current_program_identifier |
---|---|---|---|
FOO | 24.0 | NULL | H41S |
BAR | NULL | 32.05 | W89G |
Even better if it would be possible to use query to derive the column to which the measurement should go, so the result could be reduced to:
device | temperature | current_program_identifier |
---|---|---|
FOO | 24.0 | H41S |
BAR | 32.05 | W89G |
Requirements
Query needs to be fast, because:
- Basically each device generates ~500k rows per day, so the dataset is quite big and grows fast;
- Query will be executed asynchronously from multiple client computers every few seconds;
Other thoughts
Database remodeling
I’ve thought about re-modeling the database to something more normalized, but that appears to be a no-go because the collected metrics are constantly changing and we have no control over them, so we need table structure that would allow us to store any metric. If you have any ideas on a better table structure – please share it with me.
Having a separate table
I’ve thought that I could simply store latest values of metrics that are interesting for us to the separate table at the ingestion time, but the data isn’t guaranteed to come in correct time order, so that would add a big overhead of reading current data, determining if the data received is newer than the one that is already in the DB and only then performing the insert
to that separate table. So that was a no-go. Also, the metrics comes in separate messages and the message contains timestamp only for that specific metric, so each metric column would have to be accompanied by the timestamp column.
2
Answers
So, I've solved my problem by creating a
_log
table and adding atrigger
to my main table, which, on every insert, updates_log
table with the latest data.Now I have
sensors
table, which contains all sensor readings from all devices, andsensors_log
table, which contains only latest sensor readings for each device.Basically, the approach was described in https://www.timescale.com/blog/select-the-most-recent-record-of-many-items-with-postgresql/ as an
Option 5
.It seems to be working quite well for the moment, but, in the future I will dig into other methods for solving that and might update this answer if I find a more efficient way of solving this issue.
Maybe that is not a real issue because if you have a single record in the table, that would always be in the cache and not reading from the disk every insert.
Also, if you need a very flexible schema, I’d recommend you use Promscale, which would allow you to have a very flexible schema storing one metric per table. You can also use PromQL to fetch and join metrics in the same query. A significant advantage I see here is that you can have different retention policies for each metric, and that’s a great advantage because probably some of the metrics will be more important than others.
Through the labels, you can also gain the flexibility in attaching more data to a metric in case you need to enhance some metrics with more information.
The remote write allows you to send the data, and it will just create the hypertables on the fly for you.