i need to get concurrent sessions for every minute of day from database with ~2 million rows. what is the best(fastest) way to get that data?
table contains session_id, session_start_time, session_end_time
example
CREATE TABLE `session_id_table` (
`session_id` bigint NOT NULL,
`session_start_time` datetime DEFAULT NULL,
`session_end_time` datetime DEFAULT NULL,
PRIMARY KEY (`session_id`),
UNIQUE KEY `session_id_UNIQUE` (`session_id`),
KEY `idx_session_id_table_session_id` (`session_id`),
KEY `time_1` (`session_start_time`,`session_end_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin;
insert into session_id_table(session_id, session_start_time, session_end_time) values
(1, '2024-09-05 10:00:23', '2024-09-05 10:15:54'),
(2, '2024-09-05 10:00:29', '2024-09-05 10:22:54'),
(3, '2024-09-05 10:01:23', '2024-09-05 10:26:54'),
(4, '2024-09-05 10:02:20', '2024-09-05 10:21:54'),
(5, '2024-09-05 10:02:23', '2024-09-05 10:20:54'),
(6, '2024-09-05 10:21:00', '2024-09-05 10:35:54')
result should be
'2024-09-05 10:01:00', 1
'2024-09-05 10:01:00', 2
'2024-09-05 10:02:00', 1
'2024-09-05 10:02:00', 2
'2024-09-05 10:02:00', 3
'2024-09-05 10:21:00', 2
'2024-09-05 10:21:00', 3
'2024-09-05 10:21:00', 4
'2024-09-05 10:21:00', 6
so for every minute of day it needs to show active sessions at that time.
first i tried to do join for small part of data for 2 days (~30k rows per day, 300-700 session_ids per minute)
select distinct date_format(a.session_start_time, '%Y-%c-%e %H:%i'), b.session_id
from session_id_table as a
left join
session_id_table as b
on date_format(a.session_start_time, '%Y-%c-%e %H:%i')>=b.session_start_time and date_format(a.session_start_time, '%Y-%c-%e %H:%i')<=b.session_end_time
and date(a.session_start_time)>'2024-09-05'
but query didnt execute even after 3 hours
2
Answers
wanted to do answer this 3 days ago, but when benchmarking i noticed weird results(that i posted in another thread).
anyways fastest way to get concurrent session id-s is through comparing timestamps that are in another table. comparing datetimes on join gives 20-35% slower results than with timestamp. indexing data that is used on join argument seems to slow down everything not on the edges of database(added first or added last).
compared to same query with datetime instead of timestamps
if we use CTE with timestamps:
we get these execution times
if we run same CTE query but just have join on datetime instead of timestamps we get these results:
when i started this thread with python+sql script i needed to run ~55 seconds for every hour of data in database, but i could run this concurrently for faster execution.
with answer from d r i got ~5 seconds execution time when i tested it, giving the query significant improvement, but i didnt test this extensively.
2 weeks after, i decided to benchmark new query after learning where i failed in OP, saw weird results, run d r query saw weird results, so i created new thread about that weirdness. overall indexes gave super fast results at beginning of the table, but much slower results at anything not at edge, so d r answer was slower than python+sql solution that i already had for over 90% of the table.
You’ll need to generate rows for every minuute in the day in order to get all active session ids in every minute. Using recursive ctes could be helpfull here:
Addition:
First, ( @asd asd ) I’d like to thank you for your correction of my initial typos – been working on something similar (not same) and did some copy-paste actions without (or with incomplete) corrections. Thanks.
It is important to say that I’m not, in any way, someone who has some special knowledge about mysql – quite the opposite, I’ve never done a single job using mysql – the most of my jobs were (and are) tied to Oracle. On the other hand – the rdbms engines, no metter how different they could be, have a good deal of common things…
The answer to the question in coments why is this code faster than the one from question – well, there are a fair few of the reasons which you can find listed below and check using the link to the fiddle at the end. The list is not sorted in any order of significance and should be taken as a collection of some general rules and practices that were used in this answer to make this thing working a bit faster:
EXPLAIN plans for every part/step of the code used in this answer could be seen in the fiddle here.