I need to write an efficient query from my backend application so that I may know how many online sessions are there (count) per day for a given date range.
Sample input:
Year: 2020, month: 08, timezone: "Asia/Dhaka". <- timezone can be any timezone passed from backend.
Sample output:
[
{"date": "2020-08-01", "online_session_count": 3},
{"date": "2020-08-02", "online_session_count": 0},
{"date": "2020-08-03", "online_session_count": 4},
...... other rows ......
{"date": "2020-08-31", "online_session_count": 1},
]
Here is the table definition:
CREATE TABLE online_speakers (
id INT AUTO_INCREMENT PRIMARY KEY,
speaker_name VARCHAR(255) NOT NULL,
start_time DATETIME NOT NULL,
end_time DATETIME NOT NULL
);
Here are some dummy rows:
INSERT INTO online_speakers (speaker_name, start_time, end_time) VALUES
('Speaker 1', '2020-08-01 10:00:00', '2020-08-01 11:00:00'),
('Speaker 2', '2020-08-01 14:00:00', '2020-08-01 15:00:00'),
('Speaker 3', '2020-08-02 09:00:00', '2020-08-02 10:00:00'),
('Speaker 4', '2020-08-03 10:00:00', '2020-08-03 11:00:00'),
('Speaker 5', '2020-08-03 13:00:00', '2020-08-03 14:00:00'),
('Speaker 6', '2020-08-03 15:00:00', '2020-08-03 16:00:00'),
('Speaker 7', '2020-08-04 11:00:00', '2020-08-04 12:00:00'),
('Speaker 8', '2020-08-04 13:00:00', '2020-08-04 14:00:00'),
('Speaker 9', '2020-08-05 10:00:00', '2020-08-05 11:00:00'),
('Speaker 10', '2020-08-05 15:00:00', '2020-08-05 16:00:00');
And here is the query that I wrote:
WITH RECURSIVE date_range AS (
SELECT DATE('2020-08-01') AS date
UNION ALL
SELECT DATE_ADD(date, INTERVAL 1 DAY)
FROM date_range
WHERE DATE_ADD(date, INTERVAL 1 DAY) <= '2020-08-31'
)
SELECT
date_range.date,
COALESCE(session_tbl.online_session_count, 0) AS online_session_count
FROM
date_range
LEFT JOIN (
SELECT
DATE(CONVERT_TZ(start_time, 'UTC', 'Asia/Dhaka')) AS date,
COUNT(*) AS online_session_count
FROM
online_speakers
WHERE
start_time BETWEEN CONVERT_TZ('2020-08-01 00:00:00', 'Asia/Dhaka', 'UTC')
AND CONVERT_TZ('2020-08-31 23:59:59', 'Asia/Dhaka', 'UTC')
GROUP BY
date
ORDER BY
date
) session_tbl ON session_tbl.date = date_range.date;
As you can see, the where
clause is using aggregating functions, which is very inefficient for large tables containing millions of rows. In such scenarios, how do experts tackle the problem?
Any suggestions/improvement to my existing query is also welcome.
2
Answers
Use a different table to save all
timezone
value.UTC
timezone value should be ‘+00:00’ and other timezone should be according to that. Like:Now change your query accordingly…
So, you code should be:
Output:
db<>fiddle code: https://dbfiddle.uk/AZBavvxv
As a preface, I’ll say that this kind of functionality (i.e. grouping by date in a runtime-defined per-request timezone) is actually kinda rare, I don’t recall ever seeing feature in any software I’ve used, actually.
The problem here is that for high-perf you’ll need to precompute
date
-only values and create anINDEX
over them – but if you need to apply a per-request calculation to a value which results in differentdate
values then theINDEX
will useless and the RDBMS will need to do a full table-scan (or at least a range-scan, which isn’t pretty.Plenty of software, reporting functionality especially, does offer the ability to group by date, but they’ll do one of three things:
GROUP BY
.The downside common to all 3 approaches is users can’t specify a per-request timezone with which to calculate dates for the aggregation’s
GROUP BY
.In your case, consider if the added complexity of allowing user’s to set a per-request timezone for calendar-date calculations is actually worth the effort. Consider alternatives, such as using UTC-only dates and look at your actual user-base to see how many users would be adversely affected (e.g. if your users are within ~6 hours of UTC and most of these
online_speakers
rows have timestamps between 06:00 and 18:00 then you’ll get-away-just-fine with grouping by UTC dates.Whereas if you really need this functionality, and you need it indexed, then here’s how I’d do it:
Side-note: If you were using an RDBMS that supported Materialized Views or Indexed Views (which MySQL 8.x does not support) and if the only query predicates are a date-range and timezone-id, then I’d create separate
VIEW
objects for each supported timezone-id (I reckon you’d only need about ~30 timezones: 24 for the hour-aligned timezone offsets, and about 6 or so for the non-hour-aligned timezone offsets). Note that you wouldn’t necessarily convert values to a specific timezone, instead just apply the offset; I advocate this (despite it not correctly handling DST and timezone changes) because having aVIEW
for each timezone-id would be cost-prohibitive for little gain – and any issue can be worked-around by having your application code simply lookup adjustment dates and re-issuing multiple queries for each period between adjustments.Something like this:
…then repeat for all the other offsets: +0200, +0300, +0530, -0800, etc. (chars
P
andN
are used to represent the zone’s UTC offset+
and-
sign btw). You’ll need to changeINTERVAL 1 HOUR
too, of course.So your application-code would do something like this:
Implementing
getTimeZoneAdjustmentsDuringRange
is an exercise for the reader.And now back to your problem…
The solution I propose, below, is imperfect: it relies on precomputing an indexable value derived from
start_time
: but instead of trying toGROUP BY
a date, it instead groups events by hour (or by half-hour increments if you want to support that timezone); which means your query will need to transfer at most 24 rows per calendar-day (or 48 rows for 30-min timezones) – your application code then can perform the final aggregation step to group-by-date (or if you want to do it in SQL you can, but profile+benchmark your results, you might be surprised at which approach is faster).MySQL supports indexing computed-columns just fine, btw.
First, extend your
online_speakers
table with a new computed column that zeroes-out minutes, seconds, etc; i.e. roundstart_time
down to its current hour (and do not round-up to the nearest hour):start_time_hour
:Then have a query like this:
(Explanation below)
INDEX ix_start_time_hour
exists.If you might be concerned that it will be slow because it’s so long or uses a chain of 4 CTEs, remember that SQL queries do not describe a sequence of instructions or any kind of steps to follow. The "depth" of a query is in now way an indication of its performance).
I appreciate that this does look unnecessarily verbose, and I know MySQL lets you use a
SELECT
‘s column-expressions in itsGROUP BY
clause, but I’m not a MySQL expert and I don’t know the details so my query above tries to be ISO SQL-compliant for the benefit of other readers.I’ll also admit that this is untested: I don’t have access to a MySQL instance with enough suitable data to test this query with right now.
online_speakers.start_time_hour
is indexed, it means whenrangedEventsCountByHour
runs, the RDBMS needs only a range-scan of that index and perform theCOUNT(*)
using theINDEX
-alone (so it doesn’t need to even read the sourceTABLE online_speakers
).online_speakers
rows per date+hour.rangedEventsCountByOffsetHours
CTE performs the conversion of UTC-basedstart_time_hour
to the user’s own timezone.SELECT exprs+agg
vs.GROUP BY exprs
clauses). Plus it makes the whole thing easier to understand, imo.groupedByOffsetDate
CTE is what converts it from user-hours to user-calendar-dates by performing anotherGROUP BY
aggregation, but usingSUM(cnt)
not COUNT(*)` for obvious reasons.rangedEventsCountByOffsetHours
) and this step (groupedByOffsetDate
) are dependent on variable input, it means these 2 steps can’t take further advantage of anyINDEX
or precomputed columns – so this part is inherently expensive, but remember that the first step (rangedEventsCountByHour
) will return at most 24 rows per calendar day (and 0 rows for hours that had no events), so a query covering a whole month will mean 744, and 8,760 for a year – that’s a trivial amount for these CTEs to process.SELECT
is just there because SQL doesn’t let you give a name to the outermostSELECT
query for reasons I’ll never know.ORDER BY
here because in many (if not most?) cases it’s faster to sort the results in application-code; it also slows down a system because it means the RDBMS has to first buffer-up all results first, then sort them, before it can return sorted results to the client; but not using ORDER BY means the RDMBS can start returning rows to the client long before it’s finished reading/generating the full result-set.