skip to Main Content

i need to get concurrent sessions for every minute of day from database with ~2 million rows. what is the best(fastest) way to get that data?

table contains session_id, session_start_time, session_end_time

example

CREATE TABLE `session_id_table` (
  `session_id` bigint NOT NULL,
  `session_start_time` datetime DEFAULT NULL,
  `session_end_time` datetime DEFAULT NULL,
  PRIMARY KEY (`session_id`),
  UNIQUE KEY `session_id_UNIQUE` (`session_id`),
  KEY `idx_session_id_table_session_id` (`session_id`),
  KEY `time_1` (`session_start_time`,`session_end_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin;


insert into session_id_table(session_id, session_start_time, session_end_time) values 
(1, '2024-09-05 10:00:23', '2024-09-05 10:15:54'), 
(2, '2024-09-05 10:00:29', '2024-09-05 10:22:54'), 
(3, '2024-09-05 10:01:23', '2024-09-05 10:26:54'), 
(4, '2024-09-05 10:02:20', '2024-09-05 10:21:54'), 
(5, '2024-09-05 10:02:23', '2024-09-05 10:20:54'),
(6, '2024-09-05 10:21:00', '2024-09-05 10:35:54') 

result should be

'2024-09-05 10:01:00', 1 
'2024-09-05 10:01:00', 2 
'2024-09-05 10:02:00', 1
'2024-09-05 10:02:00', 2
'2024-09-05 10:02:00', 3
'2024-09-05 10:21:00', 2
'2024-09-05 10:21:00', 3
'2024-09-05 10:21:00', 4
'2024-09-05 10:21:00', 6

so for every minute of day it needs to show active sessions at that time.

first i tried to do join for small part of data for 2 days (~30k rows per day, 300-700 session_ids per minute)

select distinct date_format(a.session_start_time, '%Y-%c-%e %H:%i'), b.session_id 
from session_id_table as a
left join 
session_id_table as b
on date_format(a.session_start_time, '%Y-%c-%e %H:%i')>=b.session_start_time and date_format(a.session_start_time, '%Y-%c-%e %H:%i')<=b.session_end_time
and date(a.session_start_time)>'2024-09-05'

but query didnt execute even after 3 hours

2

Answers


  1. Chosen as BEST ANSWER

    wanted to do answer this 3 days ago, but when benchmarking i noticed weird results(that i posted in another thread).

    anyways fastest way to get concurrent session id-s is through comparing timestamps that are in another table. comparing datetimes on join gives 20-35% slower results than with timestamp. indexing data that is used on join argument seems to slow down everything not on the edges of database(added first or added last).

    select dt, session_id from datetime_timestamp as a
    inner join session_id_table as b 
    on (dt>=session_start_time) and (dt<=session_end_time)
    #where dt>='2023-05-25 00:00:00' and dt<='2023-05-25 01:00:00'
    where dt='2023-10-25 00:00:00' and dt<='2023-10-25 01:00:00'
    #where dt>='2023-12-10 00:00:00' and dt<='2023-12-10 01:00:00'
    #where dt>='2024-09-10 00:00:00' and dt<='2024-09-10 01:00:00'
    #where dt>='2024-09-20 12:00:00' and dt<='2024-09-20 13:00:00'
    

    timestamp without index compared to same query with datetime instead of timestamps datetime without index

    if we use CTE with timestamps:

    with 
    dates as
    (select distinct convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime) as date_time, unix_timestamp(convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime)) as ts
    from session_id_table)
    select date_time, session_id from dates as a
    inner join session_id_table as b 
    on (ts>=start_ts) and (ts<=end_ts)
    #where date_time>='2023-05-25 00:00:00' and date_time<='2023-05-25 01:00:00'
    where date_time>='2023-10-25 00:00:00' and date_time<='2023-10-25 01:00:00'
    #where date_time>='2023-12-10 00:00:00' and date_time<='2023-12-10 01:00:00'
    #where date_time>='2024-09-10 00:00:00' and date_time<='2024-09-10 01:00:00'
    #where date_time>='2024-09-19 00:00:00' and date_time<='2024-09-19 01:00:00'
    

    we get these execution times

    timestamps with CTE

    if we run same CTE query but just have join on datetime instead of timestamps we get these results: datetime with CTE

    when i started this thread with python+sql script i needed to run ~55 seconds for every hour of data in database, but i could run this concurrently for faster execution.

    with answer from d r i got ~5 seconds execution time when i tested it, giving the query significant improvement, but i didnt test this extensively.

    2 weeks after, i decided to benchmark new query after learning where i failed in OP, saw weird results, run d r query saw weird results, so i created new thread about that weirdness. overall indexes gave super fast results at beginning of the table, but much slower results at anything not at edge, so d r answer was slower than python+sql solution that i already had for over 90% of the table.


  2. You’ll need to generate rows for every minuute in the day in order to get all active session ids in every minute. Using recursive ctes could be helpfull here:

    --      S a m p l e    D a t a : 
    CREATE TABLE `session_id_table` (
      `session_id` bigint NOT NULL,
      `session_start_time` datetime DEFAULT NULL,
      `session_end_time` datetime DEFAULT NULL,
      PRIMARY KEY (`session_id`),
      UNIQUE KEY `session_id_UNIQUE` (`session_id`),
      KEY `idx_session_id_table_session_id` (`session_id`),
      KEY `time_1` (`session_start_time`,`session_end_time`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin;
    
    insert into session_id_table(session_id, session_start_time, session_end_time) values 
    (1, '2024-09-05 10:00:23', '2024-09-05 10:15:54'), 
    (2, '2024-09-05 10:00:29', '2024-09-05 10:22:54'), 
    (3, '2024-09-05 10:01:23', '2024-09-05 10:26:54'), 
    (4, '2024-09-05 10:02:20', '2024-09-05 10:21:54'), 
    (5, '2024-09-05 10:02:23', '2024-09-05 10:20:54'),
    (6, '2024-09-05 10:21:00', '2024-09-05 10:35:54');
    
    WITH    -- C T E (s)
      recursive log_hours  As
        ( Select 0 as hrs 
         Union All
          Select hrs + 1 as value
          From log_hours
          Where  log_hours.hrs < 23
        ),
      hrs_mins AS
        ( Select  Concat( LPAD(h.hrs, 2, '0'), ':', LPAD(m.mins, 2, '0') ) as hr_min 
          From    log_hours h
          Left  Join ( WITH 
                         recursive log_mins  As
                           ( Select  0 as mins 
                            Union All
                             Select  mins + 1 as value
                             From    log_mins
                             Where   log_mins.mins < 59
                           )
                       Select * From log_mins
                     ) m ON ( 1 = 1 )
          Order By  h.hrs, m.mins
        ), 
      dates AS
        ( Select   CONVERT( DATE_FORMAT(s.session_start_time,'%Y-%m-%d'), DATE ) as dt,
                   Min(CONVERT(DATE_FORMAT(s.session_start_time,'%Y-%m-%d-%H:%i:00'), DATETIME)) as min_dt_tm,
                   Max(CONVERT(DATE_FORMAT(s.session_end_time,'%Y-%m-%d-%H:%i:00'), DATETIME)) as max_dt_tm
          From     session_id_table s
          Group By CONVERT( DATE_FORMAT(s.session_start_time,'%Y-%m-%d'), DATE )
        )
    
    --      M a i n    S Q L : 
    Select       s.session_id, 
                 Date_Format(s.session_start_time, '%Y-%m-%d') as session_date,
                 hm.hr_min as session_hour_minute
    From         dates d
    Inner Join   hrs_mins hm ON( 1 = 1 )
    Inner Join   session_id_table s ON( CONVERT( DATE_FORMAT(s.session_start_time,'%Y-%m-%d'), DATE ) = d.dt And 
                                        hm.hr_min >= Date_Format(s.session_start_time, '%H:%i') And 
                                        hm.hr_min <= Date_Format(s.session_end_time, '%H:%i')
                                      )
    Where      CONVERT( Concat(DATE_FORMAT(d.dt,'%Y-%m-%d'), ' ', hm.hr_min), DATETIME ) >
                 ( Select min_dt_tm From dates Where dt = CONVERT(DATE_FORMAT(s.session_start_time, '%Y-%m-%d'), DATE) ) And
               CONVERT( Concat(DATE_FORMAT(d.dt,'%Y-%m-%d'), ' ', hm.hr_min), DATETIME ) <=
                 ( Select max_dt_tm From dates Where dt = CONVERT(DATE_FORMAT(s.session_start_time, '%Y-%m-%d'), DATE) )
    Order By   Date_Format(s.session_start_time, '%Y-%m-%d'), hm.hr_min, s.session_id
    
    /*      R e s u l t :
    session_id  session_date    session_hour_minute
    ----------  --------------  -------------------
             1  2024-09-05      10:01
             2  2024-09-05      10:01
             1  2024-09-05      10:02
             2  2024-09-05      10:02
             3  2024-09-05      10:02
    ......      .......         .....
             2  2024-09-05      10:16
             3  2024-09-05      10:16
             4  2024-09-05      10:16
             5  2024-09-05      10:16
    ......      .......         .....
             2  2024-09-05      10:22
             3  2024-09-05      10:22
             6  2024-09-05      10:22
             3  2024-09-05      10:23
             6  2024-09-05      10:23
    ......      .......         .....        */
    

    Addition:
    First, ( @asd asd ) I’d like to thank you for your correction of my initial typos – been working on something similar (not same) and did some copy-paste actions without (or with incomplete) corrections. Thanks.

    It is important to say that I’m not, in any way, someone who has some special knowledge about mysql – quite the opposite, I’ve never done a single job using mysql – the most of my jobs were (and are) tied to Oracle. On the other hand – the rdbms engines, no metter how different they could be, have a good deal of common things…

    The answer to the question in coments why is this code faster than the one from question – well, there are a fair few of the reasons which you can find listed below and check using the link to the fiddle at the end. The list is not sorted in any order of significance and should be taken as a collection of some general rules and practices that were used in this answer to make this thing working a bit faster:

    • the code is adjusted to use the indexes provided in OP’s create table statement
    • cte(s) used, with the exception of dates cte, are completely separated from the data – no tables were used (see the fiddle below)
    • recursion for hours and nested recursion for minutes generate a row for every minute in a day which is used as a predefined grid of rows for any posible date/day/period that exist in your data
    • code doesn’t depend on implicite datatype conversions – if there is a date comparison needed – it is compared to another date datatype explicitely – timestamp to another timestamp
    • all together, properly combined, made the code running a bit faster then the code from the question – the actual data from your table are processed with already prepared grid of all minutes in a day cross joined to all dates of interest with no conditions whatsoever ON(1=1).

    EXPLAIN plans for every part/step of the code used in this answer could be seen in the fiddle here.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search