Mysql - Round timestamps to the nearest half hour without omitting missing data

oliverm
May 4, 2023
302 views
0 votes
2 Answers

Situation:

Imagine some script that, upon completion, inserts the current timestamp and some other data into a MySQL table. It’s executed every thirty minutes, sometimes not at all, leaving gaps in the data.

Goal:

Have a query that fetches all data with the timestamp rounded to the closest half hour and empty rows (all fields except for the timestamp should be null) when there’s no data.

Restrictions:

Neither the table structure, the data itself nor the script can be altered.

Problem:

The only solution I could come up with that yields the desired result doesn’t scale. Currently the actual table counts about 50’000 rows and it already takes more than fifteen minutes to complete the query.

Example:

CREATE TABLE IF NOT EXISTS `statuses` (
    `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    `status` INT NOT NULL DEFAULT '0',
    PRIMARY KEY (`timestamp`)
);

INSERT
    IGNORE INTO `statuses` (`timestamp`, `status`)
VALUES
    ('2023-01-01 00:03:34', '164850'),
    ('2023-01-01 00:31:23', '794088'),
    ('2023-01-01 03:31:28', '686754'),
    ('2023-01-01 04:01:15', '684711'),
    ('2023-01-01 05:31:35', '116777'),
    ('2023-01-01 06:01:52', '469332'),
    ('2023-01-01 06:31:55', '816300'),
    ('2023-01-01 08:33:53', '309583'),
    ('2023-01-01 09:03:54', '847976'),
    ('2023-01-01 09:31:33', '812517');

WITH RECURSIVE `timestamps`(`timestamp`) AS (
    SELECT
        (
            SELECT
                FROM_UNIXTIME(
                    UNIX_TIMESTAMP(MIN(`timestamp`)) - MOD(UNIX_TIMESTAMP(MIN(`timestamp`)), 1800)
                )
            FROM
                `statuses`
        )
    UNION
    ALL
    SELECT
        DATE_ADD(`timestamp`, INTERVAL 30 MINUTE)
    FROM
        `timestamps`
    WHERE
        `timestamp` < (
            SELECT
                FROM_UNIXTIME(
                    UNIX_TIMESTAMP(MAX(`timestamp`)) - MOD(UNIX_TIMESTAMP(MAX(`timestamp`)), 1800)
                )
            FROM
                `statuses`
        )
)
SELECT
    `t`.`timestamp`,
    `s`.`status`
FROM
    `timestamps` AS `t`
    LEFT OUTER JOIN `statuses` AS `s` ON `t`.`timestamp` = FROM_UNIXTIME(
        UNIX_TIMESTAMP(`s`.`timestamp`) - MOD(UNIX_TIMESTAMP(`s`.`timestamp`), 1800)
    )
ORDER BY
    `t`.`timestamp` ASC;

Answers

- ThorstenKettner
- May 4, 2023 at 10:57 pm
- 0 votes
0
Your query looks fine. I’d convert the original timestamps to half hour timestamps first, though, and work on this base. And in the recursive query I select the maximum timestamp only once. I don’t know whether thhis will make a big difference.

One more point: I am aggregating the data first in order to avoid duplicates (e.g. 05:16 and 05:44 both rounding to 05:30). If it is guaranteed that two timestamps are at least 30 minutes apart, this will not be necessary.
```
WITH RECURSIVE data AS
(
  SELECT
    FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(`timestamp`) / 1800) * 1800) as ts,
    MAX(status) AS status
  FROM statuses
  GROUP BY FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(`timestamp`) / 1800) * 1800)
)
, timestamps (ts, tsmax) AS
(
  SELECT MIN(ts), MAX(ts) FROM data
  UNION ALL
  SELECT ts + INTERVAL 30 MINUTE, tsmax FROM timestamps WHERE ts < tsmax
)
SELECT t.ts, d.status
FROM timestamps t
LEFT OUTER JOIN data d ON d.ts = t.ts
ORDER BY t.ts;
```
Demo: https://dbfiddle.uk/78s6-jzC
Login or Signup to reply.

- GMB
- May 4, 2023 at 11:05 pm
- 0 votes
0
We can simplify the recursive part a little. As Thorsten Kettner mentions, there is no need to reselect the max status date on each iteration, we can do that in the anchor – and I would also argue that we don’t even need to round the max date (although that’s micro-optimization).

When it comes to the outer query, I would not recommend applying functions on the status timestamp; this is the primary key of the table, against which we do want to run a SARGEable predicate: let’s use a half-open interval instead (and we don’t need to perform the unixtime conversion again here):
```
with recursive timestamps (ts, max_ts) as (
    select from_unixtime(floor(unix_timestamp(min(ts)) / 1800) * 1800) ts, max(ts) max_ts
    from statuses
    union all
    select ts + interval 30 minute, max_ts from timestamps where ts + interval 30 minute <= max_ts
)
select t.ts, s.status
from timestamps t
left join statuses s on s.ts >= t.ts and s.ts < t.ts + interval 30 minute
order by t.ts
```
Note that I renamed column timestamp to ts, to avoid clashing with the corresponding SQL keyword.

If that’s not good enough, then one alternative would be to materialize the results of the recursive query in a calendar table.

You would typically cover a large period of time with the table. You can use the recursive query to create it:
```
create table timestamp_calendar as 
with recursive timestamps as (
    select '2022-01-01 00:00:00' ts, '2022-01-02 12:00:00' max_ts -- short period for testing
    union all
    select ts + interval 30 minute, max_ts 
    from timestamps
    where ts < max_ts
)
select * from timestamps;
```
We can declare a primary key to benefit the underlying index:
```
alter table timestamp_calendar add primary key (ts);
```
Then we can use the table in our query. Ideally, you know in advance which date range you are looking for. But if you don’t, we can bring the min/max status dates and use them to pre-filter the calendar table.
```
select t.ts, s.status
from timestamp_calendar t
inner join (select min(ts) min_ts, max(ts) max_ts from statuses) x 
    on  t.ts >  x.min_ts - interval 30 minute 
    and t.ts <= x.max_ts
left join statuses s 
    on  s.ts >= t.ts 
    and s.ts <  t.ts + interval 30 minute
order by t.ts
```
Here is a demo on DB Fiddle
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – Round timestamps to the nearest half hour without omitting missing data

Answers