Mysql - How to properly utilize a composite index in JOIN queries?

Ivaylo
January 29, 2024
199 views
0 votes
2 Answers

My problem is how to make the MySQL optimizer use a composite index in the most effective way.

I am using MySQL Server 8 and have a MyISAM table containing daily statistics for ‘cell’ objects for a period of 2 years. For every day there are about 51.000 to 57.000 cells (rows). The columns in the table are many – about 860 counters. The database cannot be normalized, because all columns are equally important. The query produces about 840 columns of statistics for a cluster of user defined list of cells. Each column is a KPI, which is calculated based on one or more raw counters. The query joins a table with cluster definitions ‘clusters_cust’ with the main statistical table ‘h_cell’. Each cell from the cluster is matched with the statistical records in table ‘h_cell’ for the same cell. The user defines a period and the result is aggregated per cluster values for each day of the reporting period.

The query looks like following:

SELECT cluster,Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62' AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time

Edit: As per TheImpaler’s comment:

You are joining two tables and applying filter on both of them. The optimizer doesn’t know if it’s better to start accessing table #1 and then scan table #2, or the other way around.

At the end of the question there are two modified variants of the query, which unfortunately either perform worse or the same.

Table ‘h_cell’ has following structure:

mysql> SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
  `Time` date NOT NULL,
  `Cell` char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  `LocalCI` tinyint NOT NULL,
  `Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `counter1` int DEFAULT NULL,
  `counter2` int DEFAULT NULL,
  `counter3` double DEFAULT NULL,
  `counter4` float DEFAULT NULL,
  ...........
  `counter860` int DEFAULT NULL,
  PRIMARY KEY (`Cell`,`Time`) USING BTREE,
  KEY `Time` (`Time`,`LocalCI`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC

Table ‘clusters_cust’ has following structure:

mysql> SHOW CREATE TABLE clusters_cust;
CREATE TABLE `clusters_cust` (
  `Cell` varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  `Cluster` varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `Comment` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  PRIMARY KEY (`Cluster`,`Cell`),
  KEY `Cell` (`Cell`),
  KEY `Comment` (`Comment`,`Cluster`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Indexes in ‘h_cell’ table:

mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell |          0 | PRIMARY  |            1 | Cell        | A         |       58258 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          0 | PRIMARY  |            2 | Time        | A         |    39090988 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            1 | Time        | A         |         730 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            2 | LocalCI     | A         |       15081 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

The Primary Key is designed to serve queries on cell level – it shows statistics for a given cell and period. I was hoping that it will also help the join queries on cluster level, but this is not exactly the case.

Below is the EXPLAIN command for a cluster of 62 cells and 2 months reporting period. It seems that from the composite Primary Key only the first member ‘Cell’ is used and not the ‘Time’ part:

+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        |   63 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+

For a larger cluster of 3.000 cells and reporting period of 2 months the situation is the same – again only the first member ‘Cell’ is used:

+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        | 4067 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+

But for the same cluster of 3.000 cells and a shorther reporting period of 1 month the Primary Key is not used at all and another index ‘Time’ is used (this index was designed for another type of queries):

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 1056817 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+

For an even larger cluster of 20.000 cells and reporting period of 2 months again the Primary Key is not used, but the ‘Time’ index:

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+

I am wondering what should be changed either in my query or in the table design, so that the optimizer to be able to use both ‘Cell’ and ‘Time’ members of Primary key index? Is this possible at all or there should be another more effective index?

Edit:

Modified query No 1 – Instead of using JOIN I execute a subquery in WHERE clause to get the needed list of cells and then IN operator is used in the query result.

SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT cell FROM clusters_cust WHERE cluster='cluster20k') AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time

EXPLAIN show the same plan as if there was a JOIN and again ‘Time’ index was used, instead of PRIMARY KEY:

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+

Modified query No 2 – again JOIN is omitted and there is a subquery which directly gets the needed list of cells from a pre-filtered table ‘clust’. One more time IN operator is used.

SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust) and Time>='2018-05-01' and Time<='2018-06-30' 
GROUP BY Time

The result of EXPLAIN is below. This time PRIMARY KEY index was used, but only the first column from it – ‘cell’ and not both columns – ‘cell’ and ‘Time’. Execution time was so poor that I was not able to wait for query to end.

+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type  | table       | partitions | type | possible_keys       | key     | key_len | ref              | rows  | filtered | Extra                        |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             |  NULL |   100.00 | Using where; Using temporary |
|  1 | SIMPLE       | h_cell      | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | <subquery2>.cell |   671 |     5.32 | Using index condition        |
|  2 | MATERIALIZED | clust       | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             | 20000 |   100.00 | NULL                         |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+

Tags: inner-join mysql

Answers

- DRapp
- January 29, 2024 at 1:55 pm
- 0 votes
0
Since alternatives have not worked, I would suggest this alternative.

on your h_cell table, have an index on ( time, cell ).

As for the query, I also added the keyword STRAIGHT_JOIN. I also qualified each column with the table (alias) to the corresponding columns to better follow which column comes from which table.

Now, from even your own description of data being daily of 51-57k records for a period of roughly 60 days is still running and computing over 300k records. Now, since you only care about ‘cluster62’ per this example, I would assume the count is less. Also, for clarity in the query, you should not have to add the tick characters around every column… table.column or alias.column gives the engine explicit qualification to prevent ambiguity of where data comes from, such as TIME may be a reserved word, but h.time is explicitly a column from the h (alias to h_cell table).
```
SELECT STRAIGHT_JOIN
        cc.cluster,
        h.Time,
        ROUND(  SUM( h.counter1 ) / SUM( h.counter1 + h.counter2 ) * 100,3) AS KPI1,
        SUM( h.counter1 ) AS KPI2,
        .......
        SUM( h.counterN ) AS KPI840
    FROM 
        h_cell h
            INNER JOIN clusters_cust cc
                ON  cc.cluster = 'cluster62' 
                AND h.cell = cc.cell
    WHERE 
            h.Time >= '2018-05-01' 
        AND h.Time <= '2018-06-30'
    GROUP BY 
        h.Time 
```
Login or Signup to reply.

Is this what you have?

Several "clusters"
Each cluster has several "cells"
Each cell have several rows in h_cell, and more come in over time.
The query needs to fetch the data for all the cells in one cluster.

Let’s restructure.

SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
  `Time` date NOT NULL,
   ClusterId SMALLINT UNSIGNED NOT NULL,  -- added
   CellId MEDIUMINT UNSIGNED NOT NULL,    -- changed
  `LocalCI` tinyint NOT NULL,
  `Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `counter1` int DEFAULT NULL,
  `counter2` int DEFAULT NULL,
  `counter3` double DEFAULT NULL,
  `counter4` float DEFAULT NULL,
  ...........
  `counter860` int DEFAULT NULL,
  PRIMARY KEY (`ClusterId, `Time`, `CellId`) USING BTREE,  -- Changed
  KEY `Time` (`Time`,`LocalCI`) USING BTREE
  -- (other queries may need other indexes)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC

Notes/questions:

What are LocalCI and Integrity for?
Are you sure MySQL 8 kept the table as MyISAM?
If 8.0 silently switched the table to InnoDB (or you are forced to later), The PK will probably need changing (for efficiency).
Check the size of the Ids that I added.
Make appropriate changes for the "Ids" in the other tables.

(minor) If you frequently do monthly scans, consider changing from

    AND  hc.Time>='2018-05-01'
    AND  hc.Time<='2018-06-30'

to this formulation:

    AND  hc.Time>='2018-05-01'
    AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH

The principle here is to turn a lot of lookups into a simple index scan.

Resulting query:

SELECT  cc.cluster, hc.Time,
        ROUND ,,,
    FROM  clusters_cust AS cc
    JOIN  h_cell AS hc  ON cc.ClusterId = hc.ClusterId
    WHERE  cc.cluster='cluster62'
      AND  hc.Time>='2018-05-01'
      AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH
    GROUP BY  hc.Time

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – How to properly utilize a composite index in JOIN queries?

Answers