My problem is how to make the MySQL optimizer use a composite index in the most effective way.
I am using MySQL Server 8 and have a MyISAM table containing daily statistics for ‘cell’ objects for a period of 2 years. For every day there are about 51.000 to 57.000 cells (rows). The columns in the table are many – about 860 counters. The database cannot be normalized, because all columns are equally important. The query produces about 840 columns of statistics for a cluster of user defined list of cells. Each column is a KPI, which is calculated based on one or more raw counters. The query joins a table with cluster definitions ‘clusters_cust’ with the main statistical table ‘h_cell’. Each cell from the cluster is matched with the statistical records in table ‘h_cell’ for the same cell. The user defines a period and the result is aggregated per cluster values for each day of the reporting period.
The query looks like following:
SELECT cluster,Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62' AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
Edit: As per TheImpaler’s comment:
You are joining two tables and applying filter on both of them. The optimizer doesn’t know if it’s better to start accessing table #1 and then scan table #2, or the other way around.
At the end of the question there are two modified variants of the query, which unfortunately either perform worse or the same.
Table ‘h_cell’ has following structure:
mysql> SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
`Time` date NOT NULL,
`Cell` char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
`LocalCI` tinyint NOT NULL,
`Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
`counter1` int DEFAULT NULL,
`counter2` int DEFAULT NULL,
`counter3` double DEFAULT NULL,
`counter4` float DEFAULT NULL,
...........
`counter860` int DEFAULT NULL,
PRIMARY KEY (`Cell`,`Time`) USING BTREE,
KEY `Time` (`Time`,`LocalCI`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC
Table ‘clusters_cust’ has following structure:
mysql> SHOW CREATE TABLE clusters_cust;
CREATE TABLE `clusters_cust` (
`Cell` varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
`Cluster` varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
`Comment` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
PRIMARY KEY (`Cluster`,`Cell`),
KEY `Cell` (`Cell`),
KEY `Comment` (`Comment`,`Cluster`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Indexes in ‘h_cell’ table:
mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell | 0 | PRIMARY | 1 | Cell | A | 58258 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 0 | PRIMARY | 2 | Time | A | 39090988 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 1 | Time | A | 730 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 2 | LocalCI | A | 15081 | NULL | NULL | | BTREE | | | YES | NULL |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
The Primary Key is designed to serve queries on cell level – it shows statistics for a given cell and period. I was hoping that it will also help the join queries on cluster level, but this is not exactly the case.
Below is the EXPLAIN command for a cluster of 62 cells and 2 months reporting period. It seems that from the composite Primary Key only the first member ‘Cell’ is used and not the ‘Time’ part:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 63 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
For a larger cluster of 3.000 cells and reporting period of 2 months the situation is the same – again only the first member ‘Cell’ is used:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 4067 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
But for the same cluster of 3.000 cells and a shorther reporting period of 1 month the Primary Key is not used at all and another index ‘Time’ is used (this index was designed for another type of queries):
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 1056817 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
For an even larger cluster of 20.000 cells and reporting period of 2 months again the Primary Key is not used, but the ‘Time’ index:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
I am wondering what should be changed either in my query or in the table design, so that the optimizer to be able to use both ‘Cell’ and ‘Time’ members of Primary key index? Is this possible at all or there should be another more effective index?
Edit:
- Modified query No 1 – Instead of using JOIN I execute a subquery in
WHERE
clause to get the needed list of cells and thenIN
operator is used in the query result.
SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT cell FROM clusters_cust WHERE cluster='cluster20k') AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
EXPLAIN
show the same plan as if there was a JOIN
and again ‘Time’ index was used, instead of PRIMARY KEY
:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
- Modified query No 2 – again
JOIN
is omitted and there is a subquery which directly gets the needed list of cells from a pre-filtered table ‘clust’. One more timeIN
operator is used.
SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust) and Time>='2018-05-01' and Time<='2018-06-30'
GROUP BY Time
The result of EXPLAIN
is below. This time PRIMARY KEY
index was used, but only the first column from it – ‘cell’ and not both columns – ‘cell’ and ‘Time’. Execution time was so poor that I was not able to wait for query to end.
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | <subquery2>.cell | 671 | 5.32 | Using index condition |
| 2 | MATERIALIZED | clust | NULL | ALL | NULL | NULL | NULL | NULL | 20000 | 100.00 | NULL |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
2
Answers
Since alternatives have not worked, I would suggest this alternative.
on your h_cell table, have an index on ( time, cell ).
As for the query, I also added the keyword STRAIGHT_JOIN. I also qualified each column with the table (alias) to the corresponding columns to better follow which column comes from which table.
Now, from even your own description of data being daily of 51-57k records for a period of roughly 60 days is still running and computing over 300k records. Now, since you only care about ‘cluster62’ per this example, I would assume the count is less. Also, for clarity in the query, you should not have to add the
tick
characters around every column… table.column or alias.column gives the engine explicit qualification to prevent ambiguity of where data comes from, such asTIME
may be a reserved word, buth.time
is explicitly a column from theh
(alias to h_cell table).Is this what you have?
h_cell
, and more come in over time.Let’s restructure.
Notes/questions:
What are LocalCI and Integrity for?
Are you sure MySQL 8 kept the table as MyISAM?
If 8.0 silently switched the table to InnoDB (or you are forced to later), The PK will probably need changing (for efficiency).
Check the size of the Ids that I added.
Make appropriate changes for the "Ids" in the other tables.
(minor) If you frequently do monthly scans, consider changing from
to this formulation:
The principle here is to turn a lot of lookups into a simple index scan.
Resulting query: