From the following table of sport matches:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat |
---|---|---|---|---|---|
852666 | 1 | 2 | 01/01/1997 | 1301 | 249 |
852842 | 1 | 2 | 13/01/1997 | 2837 | 2441 |
853471 | 2 | 1 | 05/05/1997 | 1474 | 952 |
4760 | 2 | 1 | 25/05/1998 | 1190 | 1486 |
6713 | 2 | 1 | 18/01/1999 | 2084 | 885 |
9365 | 2 | 1 | 01/11/1999 | 2894 | 2040 |
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 |
13022 | 1 | 2 | 14/08/2000 | 2722 | 2401 |
29159 | 1 | 2 | 26/08/2002 | 431 | 2769 |
44915 | 1 | 2 | 07/10/2002 | 1904 | 482 |
For a selected match id_
I’d like to return both players’ stats for their respective previous matches no matter whether the player in question was p1 or p2 in their last match. See below for the expected output for id_ = 11456
:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat | p1_prev_stat | p2_prev_stat |
---|---|---|---|---|---|---|---|
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 | 2040 | 2894 |
The following SQL works just fine on a table this size:
WITH cte_1 AS (
(
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
test.match_table
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
test.match_table
)
),
cte_2 AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
cte_1
)
SELECT
m.*,
cte_p1.prev_stat AS p1_prev_stat,
cte_p2.prev_stat AS p2_prev_stat
FROM
test.match_table AS m
JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_
AND cte_p1.player_id = m.p1_id
JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_
AND cte_p2.player_id = m.p2_id
WHERE m.id_ = 11456
ORDER BY m.match_date
However, the actual table is 1.3m rows which takes circa 12 seconds. From this answer the issue seems to be that the CTEs are loading all/most of the table rows rather than just those that are needed. However, the solution doesn’t cover this use case.
Would anyone have any suggestions as to how I might be able to improve this performance?
Here’s the SQL to create the small table:
CREATE TABLE `match_table` (
`id_` int NOT NULL AUTO_INCREMENT,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`match_date` date NOT NULL,
`p1_stat` int DEFAULT NULL,
`p2_stat` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__p1_id` (`p1_id`),
KEY `ix__p2_id` (`p2_id`),
KEY `ix__match_date` (`match_date`),
KEY `ix__comp` (`p1_id`, `p2_id`, `match_date`)
);
INSERT INTO `match_table`
VALUES
(
4760, 2, 1, '1998-05-25', 1190, 1486
),
(6713, 2, 1, '1999-01-18', 2084, 885),
(
9365, 2, 1, '1999-11-01', 2894, 2040
),
(
11456, 1, 2, '2000-05-15', 2358, 1491
),
(
13022, 1, 2, '2000-08-14', 2722, 2401
),
(
29159, 1, 2, '2002-08-26', 431, 2769
),
(
44915, 1, 2, '2002-10-07', 1904, 482
),
(
852666, 1, 2, '1997-01-01', 1301, 249
),
(
852842, 1, 2, '1997-01-13', 2837, 2441
),
(
853471, 2, 1, '1997-05-05', 1474, 952
);
2
Answers
Thanks to @suchislife for getting me on the right track. The below ensures that the
UNION_ALL
CTE only processes the rows for the player IDs for the selected match. It runs in 0.015 seconds for a 1.3m row table :)First, having the right indexes on your database is key. You should make sure that
id_
,p1_id
,p2_id
,match_date
are indexed since they are used in your query.Second, instead of using two CTEs (Common Table Expressions), use the
LAG()
function directly in your main query. This removes the need to duplicate rows and can speed up your query. Here’s how you can do it:This query does the following:
LAG()
function directly in the main query to get previousstats.
However, if this doesn’t give you the performance you need, consider creating a summary table. This table would track each player’s stats for each match. Whenever a new match result is added, update the summary table. This requires more storage and affects write performance, but greatly improves read performance.
Update based on comment
One potential solution to speed it up is to only consider matches that occurred before the match you’re interested in. So, instead of looking at every single match, we only look at those that could contain relevant information. Something like this:
Here, we first select the match of interest. Then, we only look at matches that happened before this match. We collect stats for these earlier matches, and then join this with the match of interest to get the previous stats for the players in the match.
This might not be a big improvement if most matches occurred before the match of interest. It could be more effective to create an index on
match_date
, if one doesn’t already exist.