I have following timeseries data table,
Id | Object_Id | Datetime | Sensor1 | Sensor2 |
---|---|---|---|---|
1 | 175 | ‘2022-03-24 22:01:00’ | 0 | 0 |
2 | 175 | ‘2022-03-24 22:02:00’ | 0 | 0 |
3 | 175 | ‘2022-03-24 22:03:00’ | 5.5699381666667 | 81.342836833333 |
4 | 175 | ‘2022-03-24 22:04:00’ | 5.5668366666667 | 81.281143 |
… | … | … | … | … |
48 | 175 | ‘2022-03-24 22:48:00’ | NULL | NULL |
… | … | … | … | … |
in above table, I have sensor1, sensor2 upto sensor1000 columns and each has millions of time series records.
Scenario1:
I want to retrieve starttime(min datetime) and endtime(max datetime) records for multiple sensors eg. sensor1 & sensor2
where their value IS NOT NULL.
I constructed below query for Sensor1,
(SELECT
datetime,id,sensor1
FROM
timeseries
WHERE
sensor1 IS NOT NULL
ORDER BY
datetime ASC limit 1)
Union ALL
(SELECT
datetime,id,sensor1
FROM
timeseries
WHERE
sensor1 IS NOT NULL
ORDER BY
datetime DESC limit 1)
I constructed below query for Sensor2,
(SELECT
datetime,id,sensor2
FROM
timeseries
WHERE
sensor2 IS NOT NULL
ORDER BY
datetime ASC limit 1)
Union ALL
(SELECT
datetime,id,sensor2
FROM
timeseries
WHERE
sensor2 IS NOT NULL
ORDER BY
datetime DESC limit 1)
This query works as expected. Pls observe that I created individual query for sensor1 and sensor2 as I want to handle failure scenario for individual sensor.
Problem is in real life scenario, I have millions of records, around 50 to 70 GB data.
When I run above query (for just sensor1 & sensor2), it takes significant amount of time to return the result. What if user is trying with multiple sensors. Trust me it takes time and we can observe it in real implementation.
In this DEMO it works smoothly as It has only 150 records.
I need help to improve/optimize the performance of this above query.
FYI, for each requested sensor, query is prepared in Node/express app and then app connects to MYSql server where I execute each query using for loop
const queryResults = await Promise.all(
queries.map(async (query) => {
return new Promise((resolve, reject) =>
db.query(query, [], (err, result) => {
if (err) {
return resolve([]); // If individual query fails then return empty array
} else {
return resolve(result);
}
})
);
})
);
2
Answers
This would run twice as fast (but have different output):
I suspect that your "sensor1" and "sensor2" could be any of 1…1000? If that is the case, there is no way to use "sensorN" in an index.
So, the best you can do with indexing (either for
UNION
or my suggestion) isINDEX(datetime)
. Slightly better would be to havedatetime as the first column of the
PRIMARY KEY, but I do not like that. I would prefer dropping
id` and havingand make sure that there are no duplicate datetime values. Note that
DATETIME
will have 60 minutes of dups once a year! — Consider usingTIMESTAMP
instead.Only very rarely are the 16 digits of precision of
CDOUBLE
needed; suggest changing toFLOAT
to cut the table size almost in half. (Double: 8 bytes; Float: 4 bytes, 7 significant digits).If there are lots of
NULLs
and/or the number of sensors keeps growing, I strongly recommend switching to fewer columns and more rows:Don’t store
NULLs
. Do have the first 3 columns as the `PRIMARY KEY’.In addition to what Rick suggests, you might try adding these indexes to accelerate your original query examples.
They may help. They definitely generate index-employing execution plans on your test data set rather than table scans. https://dbfiddle.uk/XqE-lIvK
There are more elaborate solutions involving indexes on computed columns and that sort of thing. But try this first.