Mysql - Return values from previous row based on date ordered union CTE running slowly

Jossy
July 20, 2023
200 views
0 votes
2 Answers

From the following table of sport matches:

id_	p1_id	p2_id	match_date	p1_stat	p2_stat
852666	1	2	01/01/1997	1301	249
852842	1	2	13/01/1997	2837	2441
853471	2	1	05/05/1997	1474	952
4760	2	1	25/05/1998	1190	1486
6713	2	1	18/01/1999	2084	885
9365	2	1	01/11/1999	2894	2040
11456	1	2	15/05/2000	2358	1491
13022	1	2	14/08/2000	2722	2401
29159	1	2	26/08/2002	431	2769
44915	1	2	07/10/2002	1904	482

For a selected match id_ I’d like to return both players’ stats for their respective previous matches no matter whether the player in question was p1 or p2 in their last match. See below for the expected output for id_ = 11456:

id_	p1_id	p2_id	match_date	p1_stat	p2_stat	p1_prev_stat	p2_prev_stat
11456	1	2	15/05/2000	2358	1491	2040	2894

The following SQL works just fine on a table this size:

WITH cte_1 AS (
  (
    SELECT 
      id_, 
      match_date, 
      p1_id AS player_id, 
      p1_stat AS stat 
    FROM 
      test.match_table 
    UNION ALL 
    SELECT 
      id_, 
      match_date, 
      p2_id AS player_id, 
      p2_stat AS stat 
    FROM 
      test.match_table
  )
), 
cte_2 AS (
  SELECT 
    id_, 
    player_id, 
    LAG(stat) OVER (
      PARTITION BY player_id 
      ORDER BY 
        match_date, 
        id_
    ) AS prev_stat 
  FROM 
    cte_1
) 
SELECT 
  m.*, 
  cte_p1.prev_stat AS p1_prev_stat, 
  cte_p2.prev_stat AS p2_prev_stat 
FROM 
  test.match_table AS m 
  JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_ 
  AND cte_p1.player_id = m.p1_id 
  JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_ 
  AND cte_p2.player_id = m.p2_id
WHERE m.id_ = 11456
ORDER BY m.match_date

However, the actual table is 1.3m rows which takes circa 12 seconds. From this answer the issue seems to be that the CTEs are loading all/most of the table rows rather than just those that are needed. However, the solution doesn’t cover this use case.

Would anyone have any suggestions as to how I might be able to improve this performance?

Here’s the SQL to create the small table:

CREATE TABLE `match_table` (
  `id_` int NOT NULL AUTO_INCREMENT, 
  `p1_id` int NOT NULL, 
  `p2_id` int NOT NULL, 
  `match_date` date NOT NULL, 
  `p1_stat` int DEFAULT NULL, 
  `p2_stat` int DEFAULT NULL, 
  PRIMARY KEY (`id_`), 
  KEY `ix__p1_id` (`p1_id`), 
  KEY `ix__p2_id` (`p2_id`), 
  KEY `ix__match_date` (`match_date`), 
  KEY `ix__comp` (`p1_id`, `p2_id`, `match_date`)
);
INSERT INTO `match_table` 
VALUES 
  (
    4760, 2, 1, '1998-05-25', 1190, 1486
  ), 
  (6713, 2, 1, '1999-01-18', 2084, 885), 
  (
    9365, 2, 1, '1999-11-01', 2894, 2040
  ), 
  (
    11456, 1, 2, '2000-05-15', 2358, 1491
  ), 
  (
    13022, 1, 2, '2000-08-14', 2722, 2401
  ), 
  (
    29159, 1, 2, '2002-08-26', 431, 2769
  ), 
  (
    44915, 1, 2, '2002-10-07', 1904, 482
  ), 
  (
    852666, 1, 2, '1997-01-01', 1301, 249
  ), 
  (
    852842, 1, 2, '1997-01-13', 2837, 2441
  ), 
  (
    853471, 2, 1, '1997-05-05', 1474, 952
  );

Answers

Chosen as BEST ANSWER

Thanks to @suchislife for getting me on the right track. The below ensures that the UNION_ALL CTE only processes the rows for the player IDs for the selected match. It runs in 0.015 seconds for a 1.3m row table :)

WITH selected_match AS (
  SELECT * FROM match_table WHERE id_ = 11456
),
previous_matches AS (
  SELECT 
    m.id_, 
    m.match_date, 
    m.p1_id AS player_id, 
    m.p1_stat AS stat 
  FROM 
    match_table AS m
  JOIN selected_match AS sm ON sm.p1_id = m.p1_id OR sm.p1_id = m.p2_id
  UNION ALL 
  SELECT 
    m.id_, 
    m.match_date, 
    m.p2_id AS player_id, 
    m.p2_stat AS stat 
  FROM 
    match_table AS m
  JOIN selected_match AS sm ON sm.p2_id = m.p1_id OR sm.p2_id = m.p2_id
), 
previous_stats AS (
  SELECT 
    id_, 
    player_id, 
    LAG(stat) OVER (
      PARTITION BY player_id 
      ORDER BY 
        match_date, 
        id_
    ) AS prev_stat 
  FROM 
    previous_matches
) 
SELECT 
  m.*, 
  prev_stat_p1.prev_stat AS p1_prev_stat, 
  prev_stat_p2.prev_stat AS p2_prev_stat 
FROM 
  selected_match AS m 
  JOIN previous_stats AS prev_stat_p1 ON prev_stat_p1.id_ = m.id_ 
  AND prev_stat_p1.player_id = m.p1_id 
  JOIN previous_stats AS prev_stat_p2 ON prev_stat_p2.id_ = m.id_ 
  AND prev_stat_p2.player_id = m.p2_id
ORDER BY m.match_date

(Edit)

- suchislife
- July 18, 2023 at 9:12 am
- 0 votes
0
First, having the right indexes on your database is key. You should make sure that id_, p1_id, p2_id, match_date are indexed since they are used in your query.

Second, instead of using two CTEs (Common Table Expressions), use the LAG() function directly in your main query. This removes the need to duplicate rows and can speed up your query. Here’s how you can do it:
```
SELECT 
    m.*, 
    LAG(m.p1_stat) OVER (PARTITION BY m.p1_id ORDER BY m.match_date, m.id_) AS p1_prev_stat, 
    LAG(m.p2_stat) OVER (PARTITION BY m.p2_id ORDER BY m.match_date, m.id_) AS p2_prev_stat 
FROM 
    test.match_table AS m 
WHERE 
    m.id_ = 11456
ORDER BY 
    m.match_date;
```
This query does the following:
- Removes the union, which was doubling your row count.
- Skips the step of joining the original table twice.
- Applies the LAG() function directly in the main query to get previous
  stats.
However, if this doesn’t give you the performance you need, consider creating a summary table. This table would track each player’s stats for each match. Whenever a new match result is added, update the summary table. This requires more storage and affects write performance, but greatly improves read performance.

Update based on comment

One potential solution to speed it up is to only consider matches that occurred before the match you’re interested in. So, instead of looking at every single match, we only look at those that could contain relevant information. Something like this:
```
WITH selected_match AS (
  SELECT * FROM match_table WHERE id_ = 11456
),
previous_matches AS (
  SELECT 
    id_, 
    match_date, 
    p1_id AS player_id, 
    p1_stat AS stat 
  FROM 
    match_table
  WHERE
    match_date < (SELECT match_date FROM selected_match)
  UNION ALL 
  SELECT 
    id_, 
    match_date, 
    p2_id AS player_id, 
    p2_stat AS stat 
  FROM 
    match_table
  WHERE
    match_date < (SELECT match_date FROM selected_match)
), 
previous_stats AS (
  SELECT 
    id_, 
    player_id, 
    LAG(stat) OVER (
      PARTITION BY player_id 
      ORDER BY 
        match_date, 
        id_
    ) AS prev_stat 
  FROM 
    previous_matches
) 
SELECT 
  m.*, 
  prev_stat_p1.prev_stat AS p1_prev_stat, 
  prev_stat_p2.prev_stat AS p2_prev_stat 
FROM 
  selected_match AS m 
  JOIN previous_stats AS prev_stat_p1 ON prev_stat_p1.id_ = m.id_ 
  AND prev_stat_p1.player_id = m.p1_id 
  JOIN previous_stats AS prev_stat_p2 ON prev_stat_p2.id_ = m.id_ 
  AND prev_stat_p2.player_id = m.p2_id
ORDER BY m.match_date
```
Here, we first select the match of interest. Then, we only look at matches that happened before this match. We collect stats for these earlier matches, and then join this with the match of interest to get the previous stats for the players in the match.

This might not be a big improvement if most matches occurred before the match of interest. It could be more effective to create an index on match_date, if one doesn’t already exist.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – Return values from previous row based on date ordered union CTE running slowly

Answers

Update based on comment