I’m seeking advice on how to optimize my timeseries database setup, which should handle a large volume of time-series data. I have around 20,000 time-series profiles with a one-year duration, using a quarterly time resolution (4 timestamps per hour). This amounts to approximately 700 million entries. Right now, I am using Azure PostgreSQL server with timescaledb extension.
Here are the details of my setup:
Hardware Specifications:
4 vCores
16 GiB memory
512 GB storage
Database Structure:
I have two tables, one for the load profiles with the columns (id, time, value, sensor_id), and another table with the columns (id, sensor_id). There are two indexes on the load profile table, one on (sensor_id, time), and another on sensor_id.
Sample Query:
A typical query I use to aggregate data is:
SELECT AVG(value), time
FROM public.loadprofilepool
WHERE sensor_id IN (
SELECT id
FROM public.sensor_table
ORDER BY RANDOM()
LIMIT 500
)
GROUP BY time;
Please note that this is a sample query where the list of sensor_ids is generated on the fly to mimic retrieval of different sets of sensors. In a real situation, the list of ids would come from elsewhere and could contain from a few to couple of thousand sensor ids.
Data Distribution:
For now, there are 24 * 4 * 365 rows (one year duration, quarterly) per sensor and there are 20,000 sensors. In the future, there will also be live sensor data, which data distribution will depend on the specific sensor.
Performance Metrics:
When running these queries, the CPU usage does not exceed 20% and memory usage is constant at about 40%.
Given these details, I’m struggling with query speed. Extracting 10 to 1000 profiles and summing them up to generate a timeseries for each timestamp currently takes about 5 seconds to several minutess, whereas my target is a few seconds for an aggregation of a couple thousand sensors.
My questions are as follows:
-
Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I’ve considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I’m not sure which, if any, would be more suitable.
-
How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance?
Thank you in advance for your help. Any suggestions or guidance would be greatly appreciated.
Best regards,
Hannes
I have tried to create different indexes and cluster the loadprofilepool table.
2
Answers
I would encourage you to create a Proof of Concept and try different options.
You can use:
Which one to choose? It depends on cost, performance, knowledge on those products from your team members…
There are some guiding principles for the setup of TimescaleDB. Below is a subset of the list
Define the objective of the project. Is it read optimized/faster writes/storage optimized
Identify the size of chunk interval for your use case. For e.g. , lower value to chunk interval makes the ingest faster, uses optimal storage(as retention runs at chunk level) but read queries running for larger duration will slow down
Try to have active chunks occupy 25% or less of RAM
Enable compression for storage optimization without data loss
Start using continuous aggregates which avoids the aggregation on raw data every time we query for it
Most importantly, read the execution plan of the queries. Sometimes, you see execution plan which scans all chunks than expected ones
For more details read the blog here
https://www.timescale.com/blog/timescale-parameters-you-should-know-about-and-tune-to-maximize-your-performance/