Situation:
Imagine some script that, upon completion, inserts the current timestamp and some other data into a MySQL table. It’s executed every thirty minutes, sometimes not at all, leaving gaps in the data.
Goal:
Have a query that fetches all data with the timestamp rounded to the closest half hour and empty rows (all fields except for the timestamp should be null) when there’s no data.
Restrictions:
Neither the table structure, the data itself nor the script can be altered.
Problem:
The only solution I could come up with that yields the desired result doesn’t scale. Currently the actual table counts about 50’000 rows and it already takes more than fifteen minutes to complete the query.
Example:
CREATE TABLE IF NOT EXISTS `statuses` (
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`status` INT NOT NULL DEFAULT '0',
PRIMARY KEY (`timestamp`)
);
INSERT
IGNORE INTO `statuses` (`timestamp`, `status`)
VALUES
('2023-01-01 00:03:34', '164850'),
('2023-01-01 00:31:23', '794088'),
('2023-01-01 03:31:28', '686754'),
('2023-01-01 04:01:15', '684711'),
('2023-01-01 05:31:35', '116777'),
('2023-01-01 06:01:52', '469332'),
('2023-01-01 06:31:55', '816300'),
('2023-01-01 08:33:53', '309583'),
('2023-01-01 09:03:54', '847976'),
('2023-01-01 09:31:33', '812517');
WITH RECURSIVE `timestamps`(`timestamp`) AS (
SELECT
(
SELECT
FROM_UNIXTIME(
UNIX_TIMESTAMP(MIN(`timestamp`)) - MOD(UNIX_TIMESTAMP(MIN(`timestamp`)), 1800)
)
FROM
`statuses`
)
UNION
ALL
SELECT
DATE_ADD(`timestamp`, INTERVAL 30 MINUTE)
FROM
`timestamps`
WHERE
`timestamp` < (
SELECT
FROM_UNIXTIME(
UNIX_TIMESTAMP(MAX(`timestamp`)) - MOD(UNIX_TIMESTAMP(MAX(`timestamp`)), 1800)
)
FROM
`statuses`
)
)
SELECT
`t`.`timestamp`,
`s`.`status`
FROM
`timestamps` AS `t`
LEFT OUTER JOIN `statuses` AS `s` ON `t`.`timestamp` = FROM_UNIXTIME(
UNIX_TIMESTAMP(`s`.`timestamp`) - MOD(UNIX_TIMESTAMP(`s`.`timestamp`), 1800)
)
ORDER BY
`t`.`timestamp` ASC;
2
Answers
Your query looks fine. I’d convert the original timestamps to half hour timestamps first, though, and work on this base. And in the recursive query I select the maximum timestamp only once. I don’t know whether thhis will make a big difference.
One more point: I am aggregating the data first in order to avoid duplicates (e.g. 05:16 and 05:44 both rounding to 05:30). If it is guaranteed that two timestamps are at least 30 minutes apart, this will not be necessary.
Demo: https://dbfiddle.uk/78s6-jzC
We can simplify the recursive part a little. As Thorsten Kettner mentions, there is no need to reselect the max status date on each iteration, we can do that in the anchor – and I would also argue that we don’t even need to round the max date (although that’s micro-optimization).
When it comes to the outer query, I would not recommend applying functions on the status timestamp; this is the primary key of the table, against which we do want to run a SARGEable predicate: let’s use a half-open interval instead (and we don’t need to perform the unixtime conversion again here):
Note that I renamed column
timestamp
tots
, to avoid clashing with the corresponding SQL keyword.If that’s not good enough, then one alternative would be to materialize the results of the recursive query in a calendar table.
You would typically cover a large period of time with the table. You can use the recursive query to create it:
We can declare a primary key to benefit the underlying index:
Then we can use the table in our query. Ideally, you know in advance which date range you are looking for. But if you don’t, we can bring the min/max status dates and use them to pre-filter the calendar table.
Here is a demo on DB Fiddle