skip to Main Content

My problem is how to make the MySQL optimizer use a composite index in the most effective way.

I am using MySQL Server 8 and have a MyISAM table containing daily statistics for ‘cell’ objects for a period of 2 years. For every day there are about 51.000 to 57.000 cells (rows). The columns in the table are many – about 860 counters. The database cannot be normalized, because all columns are equally important. The query produces about 840 columns of statistics for a cluster of user defined list of cells. Each column is a KPI, which is calculated based on one or more raw counters. The query joins a table with cluster definitions ‘clusters_cust’ with the main statistical table ‘h_cell’. Each cell from the cluster is matched with the statistical records in table ‘h_cell’ for the same cell. The user defines a period and the result is aggregated per cluster values for each day of the reporting period.

The query looks like following:

SELECT cluster,Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62' AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time 

Edit: As per TheImpaler’s comment:

You are joining two tables and applying filter on both of them. The optimizer doesn’t know if it’s better to start accessing table #1 and then scan table #2, or the other way around.

At the end of the question there are two modified variants of the query, which unfortunately either perform worse or the same.

Table ‘h_cell’ has following structure:

mysql> SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
  `Time` date NOT NULL,
  `Cell` char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  `LocalCI` tinyint NOT NULL,
  `Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `counter1` int DEFAULT NULL,
  `counter2` int DEFAULT NULL,
  `counter3` double DEFAULT NULL,
  `counter4` float DEFAULT NULL,
  ...........
  `counter860` int DEFAULT NULL,
  PRIMARY KEY (`Cell`,`Time`) USING BTREE,
  KEY `Time` (`Time`,`LocalCI`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC

Table ‘clusters_cust’ has following structure:

mysql> SHOW CREATE TABLE clusters_cust;
CREATE TABLE `clusters_cust` (
  `Cell` varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  `Cluster` varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `Comment` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  PRIMARY KEY (`Cluster`,`Cell`),
  KEY `Cell` (`Cell`),
  KEY `Comment` (`Comment`,`Cluster`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Indexes in ‘h_cell’ table:

mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell |          0 | PRIMARY  |            1 | Cell        | A         |       58258 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          0 | PRIMARY  |            2 | Time        | A         |    39090988 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            1 | Time        | A         |         730 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            2 | LocalCI     | A         |       15081 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

The Primary Key is designed to serve queries on cell level – it shows statistics for a given cell and period. I was hoping that it will also help the join queries on cluster level, but this is not exactly the case.

Below is the EXPLAIN command for a cluster of 62 cells and 2 months reporting period. It seems that from the composite Primary Key only the first member ‘Cell’ is used and not the ‘Time’ part:

+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        |   63 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+

For a larger cluster of 3.000 cells and reporting period of 2 months the situation is the same – again only the first member ‘Cell’ is used:

+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        | 4067 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+

But for the same cluster of 3.000 cells and a shorther reporting period of 1 month the Primary Key is not used at all and another index ‘Time’ is used (this index was designed for another type of queries):

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 1056817 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+

For an even larger cluster of 20.000 cells and reporting period of 2 months again the Primary Key is not used, but the ‘Time’ index:

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+

I am wondering what should be changed either in my query or in the table design, so that the optimizer to be able to use both ‘Cell’ and ‘Time’ members of Primary key index? Is this possible at all or there should be another more effective index?

Edit:

  • Modified query No 1 – Instead of using JOIN I execute a subquery in WHERE clause to get the needed list of cells and then IN operator is used in the query result.
SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT cell FROM clusters_cust WHERE cluster='cluster20k') AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time 

EXPLAIN show the same plan as if there was a JOIN and again ‘Time’ index was used, instead of PRIMARY KEY:

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
  • Modified query No 2 – again JOIN is omitted and there is a subquery which directly gets the needed list of cells from a pre-filtered table ‘clust’. One more time IN operator is used.
SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust) and Time>='2018-05-01' and Time<='2018-06-30' 
GROUP BY Time 

The result of EXPLAIN is below. This time PRIMARY KEY index was used, but only the first column from it – ‘cell’ and not both columns – ‘cell’ and ‘Time’. Execution time was so poor that I was not able to wait for query to end.

+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type  | table       | partitions | type | possible_keys       | key     | key_len | ref              | rows  | filtered | Extra                        |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             |  NULL |   100.00 | Using where; Using temporary |
|  1 | SIMPLE       | h_cell      | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | <subquery2>.cell |   671 |     5.32 | Using index condition        |
|  2 | MATERIALIZED | clust       | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             | 20000 |   100.00 | NULL                         |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+

2

Answers


  1. Since alternatives have not worked, I would suggest this alternative.

    on your h_cell table, have an index on ( time, cell ).

    As for the query, I also added the keyword STRAIGHT_JOIN. I also qualified each column with the table (alias) to the corresponding columns to better follow which column comes from which table.

    Now, from even your own description of data being daily of 51-57k records for a period of roughly 60 days is still running and computing over 300k records. Now, since you only care about ‘cluster62’ per this example, I would assume the count is less. Also, for clarity in the query, you should not have to add the tick characters around every column… table.column or alias.column gives the engine explicit qualification to prevent ambiguity of where data comes from, such as TIME may be a reserved word, but h.time is explicitly a column from the h (alias to h_cell table).

    SELECT STRAIGHT_JOIN
            cc.cluster,
            h.Time,
            ROUND(  SUM( h.counter1 ) / SUM( h.counter1 + h.counter2 ) * 100,3) AS KPI1,
            SUM( h.counter1 ) AS KPI2,
            .......
            SUM( h.counterN ) AS KPI840
        FROM 
            h_cell h
                INNER JOIN clusters_cust cc
                    ON  cc.cluster = 'cluster62' 
                    AND h.cell = cc.cell
        WHERE 
                h.Time >= '2018-05-01' 
            AND h.Time <= '2018-06-30'
        GROUP BY 
            h.Time 
    
    Login or Signup to reply.
  2. Is this what you have?

    • Several "clusters"
    • Each cluster has several "cells"
    • Each cell have several rows in h_cell, and more come in over time.
    • The query needs to fetch the data for all the cells in one cluster.

    Let’s restructure.

    SHOW CREATE TABLE h_cell;
    CREATE TABLE `h_cell` (
      `Time` date NOT NULL,
       ClusterId SMALLINT UNSIGNED NOT NULL,  -- added
       CellId MEDIUMINT UNSIGNED NOT NULL,    -- changed
      `LocalCI` tinyint NOT NULL,
      `Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
      `counter1` int DEFAULT NULL,
      `counter2` int DEFAULT NULL,
      `counter3` double DEFAULT NULL,
      `counter4` float DEFAULT NULL,
      ...........
      `counter860` int DEFAULT NULL,
      PRIMARY KEY (`ClusterId, `Time`, `CellId`) USING BTREE,  -- Changed
      KEY `Time` (`Time`,`LocalCI`) USING BTREE
      -- (other queries may need other indexes)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC
    

    Notes/questions:

    • What are LocalCI and Integrity for?

    • Are you sure MySQL 8 kept the table as MyISAM?

    • If 8.0 silently switched the table to InnoDB (or you are forced to later), The PK will probably need changing (for efficiency).

    • Check the size of the Ids that I added.

    • Make appropriate changes for the "Ids" in the other tables.

    • (minor) If you frequently do monthly scans, consider changing from

          AND  hc.Time>='2018-05-01'
          AND  hc.Time<='2018-06-30'
      

      to this formulation:

          AND  hc.Time>='2018-05-01'
          AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH
      

    The principle here is to turn a lot of lookups into a simple index scan.

    Resulting query:

    SELECT  cc.cluster, hc.Time,
            ROUND ,,,
        FROM  clusters_cust AS cc
        JOIN  h_cell AS hc  ON cc.ClusterId = hc.ClusterId
        WHERE  cc.cluster='cluster62'
          AND  hc.Time>='2018-05-01'
          AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH
        GROUP BY  hc.Time 
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search