skip to Main Content

I need to write an efficient query from my backend application so that I may know how many online sessions are there (count) per day for a given date range.

Sample input:

Year: 2020, month: 08, timezone: "Asia/Dhaka". <- timezone can be any timezone passed from backend.

Sample output:

[
{"date": "2020-08-01", "online_session_count": 3},
{"date": "2020-08-02", "online_session_count": 0},
{"date": "2020-08-03", "online_session_count": 4},
...... other rows ......
{"date": "2020-08-31", "online_session_count": 1},
]

Here is the table definition:

CREATE TABLE online_speakers (
    id INT AUTO_INCREMENT PRIMARY KEY,
    speaker_name VARCHAR(255) NOT NULL,
    start_time DATETIME NOT NULL,
    end_time DATETIME NOT NULL
);

Here are some dummy rows:

INSERT INTO online_speakers (speaker_name, start_time, end_time) VALUES
('Speaker 1', '2020-08-01 10:00:00', '2020-08-01 11:00:00'),
('Speaker 2', '2020-08-01 14:00:00', '2020-08-01 15:00:00'),
('Speaker 3', '2020-08-02 09:00:00', '2020-08-02 10:00:00'),
('Speaker 4', '2020-08-03 10:00:00', '2020-08-03 11:00:00'),
('Speaker 5', '2020-08-03 13:00:00', '2020-08-03 14:00:00'),
('Speaker 6', '2020-08-03 15:00:00', '2020-08-03 16:00:00'),
('Speaker 7', '2020-08-04 11:00:00', '2020-08-04 12:00:00'),
('Speaker 8', '2020-08-04 13:00:00', '2020-08-04 14:00:00'),
('Speaker 9', '2020-08-05 10:00:00', '2020-08-05 11:00:00'),
('Speaker 10', '2020-08-05 15:00:00', '2020-08-05 16:00:00');

And here is the query that I wrote:

WITH RECURSIVE date_range AS (
    SELECT DATE('2020-08-01') AS date
    UNION ALL
    SELECT DATE_ADD(date, INTERVAL 1 DAY)
    FROM date_range
    WHERE DATE_ADD(date, INTERVAL 1 DAY) <= '2020-08-31'
)
SELECT
    date_range.date,
    COALESCE(session_tbl.online_session_count, 0) AS online_session_count
FROM
    date_range
LEFT JOIN (
    SELECT
        DATE(CONVERT_TZ(start_time, 'UTC', 'Asia/Dhaka')) AS date,
        COUNT(*) AS online_session_count
    FROM
        online_speakers
    WHERE
        start_time BETWEEN CONVERT_TZ('2020-08-01 00:00:00', 'Asia/Dhaka', 'UTC')
                      AND CONVERT_TZ('2020-08-31 23:59:59', 'Asia/Dhaka', 'UTC')
    GROUP BY
        date
    ORDER BY
        date
) session_tbl ON session_tbl.date = date_range.date;

As you can see, the where clause is using aggregating functions, which is very inefficient for large tables containing millions of rows. In such scenarios, how do experts tackle the problem?

Any suggestions/improvement to my existing query is also welcome.

2

Answers


  1. Use a different table to save all timezone value.

    UTC timezone value should be ‘+00:00’ and other timezone should be according to that. Like:

    'UTC'          : '+00:00'
    'Asia/Dhaka'   : '+06:00' 
    'Asia/Kolkata' : '+05:30'
    

    Now change your query accordingly…

    So, you code should be:

    
    WITH RECURSIVE date_range AS (
        SELECT DATE('2020-08-01') AS date
        UNION ALL
        SELECT DATE_ADD(date, INTERVAL 1 DAY)
        FROM date_range
        WHERE DATE_ADD(date, INTERVAL 1 DAY) <= '2020-08-31'
    )
    SELECT
        date_range.date,
        COALESCE(session_tbl.online_session_count, 0) AS online_session_count
    FROM
        date_range
    LEFT JOIN (
        SELECT
            DATE(CONVERT_TZ(start_time, '+00:00', '+06:00')) AS date,
            COUNT(*) AS online_session_count
        FROM
            online_speakers
        WHERE
            start_time BETWEEN CONVERT_TZ('2020-08-01 00:00:00', '+00:00', '+06:00')
                          AND CONVERT_TZ('2020-08-31 23:59:59', '+00:00', '+06:00')
        GROUP BY
            date
        ORDER BY
            date
    ) session_tbl ON session_tbl.date = date_range.date;
    

    Output:
    enter image description here

    db<>fiddle code: https://dbfiddle.uk/AZBavvxv

    Login or Signup to reply.
  2. As a preface, I’ll say that this kind of functionality (i.e. grouping by date in a runtime-defined per-request timezone) is actually kinda rare, I don’t recall ever seeing feature in any software I’ve used, actually.

    The problem here is that for high-perf you’ll need to precompute date-only values and create an INDEX over them – but if you need to apply a per-request calculation to a value which results in different date values then the INDEX will useless and the RDBMS will need to do a full table-scan (or at least a range-scan, which isn’t pretty.

    Plenty of software, reporting functionality especially, does offer the ability to group by date, but they’ll do one of three things:

    1. Use a single (non-UTC) local timezone for everything. This was (and probably still is) how a lot of on-prem software still works, even though it’s a bad idea anyway (because local timezone offsets are not immutable), which means it’s impossible correctly to map those timestamps back to UTC, and more problems besides.
    2. Use UTC exclusively: users can only see data grouped by UTC date, not their own (local timezone-defined) dates.
    3. In a system where data can be logically partitioned by something that’s also associated with a timezone, such as a web-discussion-forum showing a user their own posts-by-day grouped by their own TZ’s dates, or (especially) in a multi-tenant system where each tenant has an associated timezone, then the system can determine the date for that tenant and save that with the original UTC timestamp – this column will then indexed and will be very fast when performing a GROUP BY.

    The downside common to all 3 approaches is users can’t specify a per-request timezone with which to calculate dates for the aggregation’s GROUP BY.


    In your case, consider if the added complexity of allowing user’s to set a per-request timezone for calendar-date calculations is actually worth the effort. Consider alternatives, such as using UTC-only dates and look at your actual user-base to see how many users would be adversely affected (e.g. if your users are within ~6 hours of UTC and most of these online_speakers rows have timestamps between 06:00 and 18:00 then you’ll get-away-just-fine with grouping by UTC dates.

    Whereas if you really need this functionality, and you need it indexed, then here’s how I’d do it:


    Side-note: If you were using an RDBMS that supported Materialized Views or Indexed Views (which MySQL 8.x does not support) and if the only query predicates are a date-range and timezone-id, then I’d create separate VIEW objects for each supported timezone-id (I reckon you’d only need about ~30 timezones: 24 for the hour-aligned timezone offsets, and about 6 or so for the non-hour-aligned timezone offsets). Note that you wouldn’t necessarily convert values to a specific timezone, instead just apply the offset; I advocate this (despite it not correctly handling DST and timezone changes) because having a VIEW for each timezone-id would be cost-prohibitive for little gain – and any issue can be worked-around by having your application code simply lookup adjustment dates and re-issuing multiple queries for each period between adjustments.

    Something like this:

    CREATE VIEW online_speakers_daily_counts_P0100 AS
    SELECT
        DATE( DATE_ADD( start_time, INTERVAL 1 HOUR ) ) AS start_time_date
        COUNT(*) AS count
    FROM
        online_speakers
    GROUP BY
        DATE( DATE_ADD( start_time, INTERVAL 1 HOUR ) );
    
    CREATE UNIQUE CLUSTERED INDEX ON online_speakers_daily_counts_P0100 ( start_time_date ) INCLUDE ( count );
    

    …then repeat for all the other offsets: +0200, +0300, +0530, -0800, etc. (chars P and N are used to represent the zone’s UTC offset + and - sign btw). You’ll need to change INTERVAL 1 HOUR too, of course.

    So your application-code would do something like this:

    let dtRangeStart  : Date = ...
    let dtRangeEndExcl: Date = ... // I assuming this is an exclusive-upper-bound and pre-validated such that `dtRangeEnd > dtRangeStart`.
    let timezoneId    : IanaTimeZoneId = ...
    
    let timeZoneEventsInRange: List<(DateTime utc, TimeSpan utcOffset)> = getTimeZoneAdjustmentsDuringRange( dtRangeStart, dtRangeEndExcl, timeZoneId );
    
    let allResults = new List<(Date d, Int count)>();
    
    foreach( (DateTime utc, TimeSpan utcOffset) in timeZoneEventsInRange ) {
        let sign: Char = ( utcOffset.totalTicks == 0 ) ? 'Z' : ( utcOffset.totalTicks > 0 ) ? 'P' : 'N';
    
        let sqlViewName = "online_speakers_daily_counts_{sign}{utcOffset.hh}{utcOffset.mm}";
    
        if( !doesSqlViewExist( sqlViewName ) ) {
            return Error( "Sorry, $timezoneId is not a supported timezone" 
        );
    
        let results = execSqlQuery( "SELECT start_time_date, count FROM {sqlViewName} WHERE start_time_date >= {dtRangeStart} AND start_time_date < {dtRangeEndExcl}" );
        allResults.append( results );
    }
    
    allResults.sort( (x,y) => x.d - y.d );
    
    printResults( allResults );
    

    Implementing getTimeZoneAdjustmentsDuringRange is an exercise for the reader.

    And now back to your problem…


    The solution I propose, below, is imperfect: it relies on precomputing an indexable value derived from start_time: but instead of trying to GROUP BY a date, it instead groups events by hour (or by half-hour increments if you want to support that timezone); which means your query will need to transfer at most 24 rows per calendar-day (or 48 rows for 30-min timezones) – your application code then can perform the final aggregation step to group-by-date (or if you want to do it in SQL you can, but profile+benchmark your results, you might be surprised at which approach is faster).

    MySQL supports indexing computed-columns just fine, btw.

    First, extend your online_speakers table with a new computed column that zeroes-out minutes, seconds, etc; i.e. round start_time down to its current hour (and do not round-up to the nearest hour): start_time_hour:

    ALTER TABLE online_speakers
        ADD COLUMN start_time_hour datetime GENERATED ALWAYS AS (
            DATEADD( DATE( start_time ), INTERVAL HOUR( start_time ) HOURS )
        ) STORED NOT NULL;
    
    CREATE INDEX ix_start_time_hour ON online_speakers ( start_time_hour );
    

    Then have a query like this:

    (Explanation below)

    DECLARE @rangeStartAsUtc   datetime = // Input parameter, derived from the user-input requested range-start calendar date, then converted to a UTC date+time value considering the requested timezone-id
    DECLARE @rangeEndExclAsUtc datetime = // Input parameter, exclusive upper-bound.
    DECLARE @userUcOffset      time     = // Input parameter, derived from the requested timezone-id, determined by the calling application.
        
    WITH rangedEventsCountByHour AS (
    
        SELECT
            s.start_time_hour, /* Defined above as `DATEADD( DATE( start_time ), INTERVAL HOUR( start_time ) HOURS )` */
            COUNT(*) AS cnt
        FROM
            online_speakers AS s
        WHERE
            s.start_time_hour >= @rangeStartAsUtc
            AND
            s.start_time_hour <  @rangeEndExclAsUtc
        GROUP BY
            start_time_hour
    ),
    rangedEventsCountByOffsetHours AS (
    
        SELECT
            DATEADD( s.start_time_hour, @userUcOffset ) AS start_time_user,
            cnt
        FROM
            rangedEventsCountByHour AS s
    ),
    groupedByOffsetDate AS (
    
        SELECT
            DATE( r.start_time_user ) AS start_date_user,
            SUM( r.cnt ) AS cnt
        FROM
            rangedEventsCountByHour AS r
        GROUP BY
            DATE( r.start_time_user )
    )
    SELECT
        g.start_date_user,
        g.cnt
    FROM
        groupedByOffsetDate AS g;
    
    • The query uses named CTEs instead of inner-queries; this is my preference because it makes queries more self-documenting.
    • This query should be fast, provided the INDEX ix_start_time_hour exists.
      • If you might be concerned that it will be slow because it’s so long or uses a chain of 4 CTEs, remember that SQL queries do not describe a sequence of instructions or any kind of steps to follow. The "depth" of a query is in now way an indication of its performance).

      • I appreciate that this does look unnecessarily verbose, and I know MySQL lets you use a SELECT‘s column-expressions in its GROUP BY clause, but I’m not a MySQL expert and I don’t know the details so my query above tries to be ISO SQL-compliant for the benefit of other readers.

      • I’ll also admit that this is untested: I don’t have access to a MySQL instance with enough suitable data to test this query with right now.

    1. Because the (computed-coumn) online_speakers.start_time_hour is indexed, it means when rangedEventsCountByHour runs, the RDBMS needs only a range-scan of that index and perform the COUNT(*) using the INDEX-alone (so it doesn’t need to even read the source TABLE online_speakers).
      • So this CTE will return the number of online_speakers rows per date+hour.
    2. The rangedEventsCountByOffsetHours CTE performs the conversion of UTC-based start_time_hour to the user’s own timezone.
      • This is done in its own named CTE because I wasn’t confident of how MySQL might handle non-matching SELECT exprs+agg vs. GROUP BY exprs clauses). Plus it makes the whole thing easier to understand, imo.
    3. The groupedByOffsetDate CTE is what converts it from user-hours to user-calendar-dates by performing another GROUP BY aggregation, but using SUM(cnt) not COUNT(*)` for obvious reasons.
      • Because the results from the previous step (rangedEventsCountByOffsetHours) and this step (groupedByOffsetDate) are dependent on variable input, it means these 2 steps can’t take further advantage of any INDEX or precomputed columns – so this part is inherently expensive, but remember that the first step (rangedEventsCountByHour) will return at most 24 rows per calendar day (and 0 rows for hours that had no events), so a query covering a whole month will mean 744, and 8,760 for a year – that’s a trivial amount for these CTEs to process.
    4. The final outer SELECT is just there because SQL doesn’t let you give a name to the outermost SELECT query for reasons I’ll never know.
    5. I’m not using ORDER BY here because in many (if not most?) cases it’s faster to sort the results in application-code; it also slows down a system because it means the RDBMS has to first buffer-up all results first, then sort them, before it can return sorted results to the client; but not using ORDER BY means the RDMBS can start returning rows to the client long before it’s finished reading/generating the full result-set.
    • Don’t forget that this query is for a single UTC offset, so you need to check if the date-range includes any DST start/end dates, or if any other changes happened to the zone’s UTC offset during that period – if so, then you’ll need to split the date-range up (at every DST or offset-change date) and run the query for each (uniformly UTC offset) run of dates.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search