skip to Main Content

Description

So, I’m working on a project that stores sensor measurements from multiple devices in PostgreSQL+TimescaleDB database.

The structure of the table (hypertable):

column_name type comment
identifier text device identifier
key text name of the metric
value_num double precision numeric measurement value
value_text text text measurement value
timestamp timestamp with time zone timestamp of the measurement

Table has indexes on (identifier, timestamp) and (identifier, key, timestamp).

Measurement value
The measurement can have measurement value in either value_num or value_text column depending on the measurement type.

Metric types
Each device can have different metrics. For example one device (FOO) might have:

  • temperature_air (with value_num as that metric has numeric measurement)
  • current_program_identifier (with value_text as that metric has text measurement)

and other device (BAR) might have:

  • temperature_water (with value_num as that metric has numeric measurement)
  • water_level (with value_num as that metric has numeric measurement)
  • current_program_identifier (with value_text as that metric has text measurement)

Now I want to have a query, or, better yet, materialized view, that would show me the most recent measurements of all metrics grouped by device. Meaning, that I would expect to have something like:

device temperature_air temperature_water current_program_identifier
FOO 24.0 NULL H41S
BAR NULL 32.05 W89G

Even better if it would be possible to use query to derive the column to which the measurement should go, so the result could be reduced to:

device temperature current_program_identifier
FOO 24.0 H41S
BAR 32.05 W89G

Requirements

Query needs to be fast, because:

  • Basically each device generates ~500k rows per day, so the dataset is quite big and grows fast;
  • Query will be executed asynchronously from multiple client computers every few seconds;

Other thoughts

Database remodeling
I’ve thought about re-modeling the database to something more normalized, but that appears to be a no-go because the collected metrics are constantly changing and we have no control over them, so we need table structure that would allow us to store any metric. If you have any ideas on a better table structure – please share it with me.

Having a separate table
I’ve thought that I could simply store latest values of metrics that are interesting for us to the separate table at the ingestion time, but the data isn’t guaranteed to come in correct time order, so that would add a big overhead of reading current data, determining if the data received is newer than the one that is already in the DB and only then performing the insert to that separate table. So that was a no-go. Also, the metrics comes in separate messages and the message contains timestamp only for that specific metric, so each metric column would have to be accompanied by the timestamp column.

2

Answers


  1. Chosen as BEST ANSWER

    So, I've solved my problem by creating a _log table and adding a trigger to my main table, which, on every insert, updates _log table with the latest data.

    Now I have sensors table, which contains all sensor readings from all devices, and sensors_log table, which contains only latest sensor readings for each device.

    Basically, the approach was described in https://www.timescale.com/blog/select-the-most-recent-record-of-many-items-with-postgresql/ as an Option 5.

    It seems to be working quite well for the moment, but, in the future I will dig into other methods for solving that and might update this answer if I find a more efficient way of solving this issue.


  2. I’ve thought that I could simply store latest values of metrics that are interesting for us to the separate table at the ingestion time, but the data isn’t guaranteed to come in correct time order, so that would add a big overhead of reading current data, determining if the data received is newer than the one that is already in the DB and only then performing the insert to that separate table. So that was a no-go. Also, the metrics comes in separate messages and the message contains timestamp only for that specific metric, so each metric column would have to be accompanied by the timestamp column.

    Maybe that is not a real issue because if you have a single record in the table, that would always be in the cache and not reading from the disk every insert.

    Also, if you need a very flexible schema, I’d recommend you use Promscale, which would allow you to have a very flexible schema storing one metric per table. You can also use PromQL to fetch and join metrics in the same query. A significant advantage I see here is that you can have different retention policies for each metric, and that’s a great advantage because probably some of the metrics will be more important than others.

    Through the labels, you can also gain the flexibility in attaching more data to a metric in case you need to enhance some metrics with more information.

    The remote write allows you to send the data, and it will just create the hypertables on the fly for you.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search