CREATE TABLE `test` (
`id` bigint UNSIGNED NOT NULL AUTO_INCREMENT,
`month` timestamp NOT NULL DEFAULT '2018-01-01 00:00:00',
`b` varchar(255) NOT NULL DEFAULT '',
`c` varchar(255) NOT NULL DEFAULT '',
`d` varchar(255) NOT NULL DEFAULT '',
`e` varchar(255) NOT NULL DEFAULT '',
...
PRIMARY KEY (`id`),
INDEX `idx_month_b`(`month`, `b`) USING BTREE,
INDEX `idx_d`(`c`) USING BTREE,
...
);
background
The total number of records in the table is about 4 million;
The data distribution on the month field is not very even, with more than half of the data concentrated after the year 2023.
The cardinality of the index for the month field is 80.The cardinality of the index for the ‘c’ field is 2.65 million+.
Problem 1
performance
select * from test where month>='2024-07-01' and month<='2024-07-31' and d=? order by id desc limit 10;
Under the condition month>=’2024-07-01′ and month<=’2024-07-31′, there are actually 210,000 records.
The actual execution result of the SQL is empty.
The execution plan shows that the primary key index was used, with rows=85. When I explicitly specified to use the index (month, b), the rows=440,000. I expected it to use the (month, b) index, and the actual execution result is significantly faster when using the forced index.
Why?
Why did it not use the expected index, and why is the estimated number of scanned rows so low when using the primary key?
Problem 2
performance
select * from test where c='test1' order by id desc limit 1;
select * from test where c='test2' order by id desc limit 1;
The index on column c has good selectivity, with varying numbers of records ranging from 1 to 4,000 for different c values.
Under the condition c=’test1′, there are 3,000 records, while under the condition c=’test2′, there are 2,900 records.
The actual execution result of the SQL is not empty.
When c=’test1′, the query used the index on c, while for c=’test2′, it used the primary key index, resulting in a significant performance difference.
Why?
In the case where the data volumes are so close, what possibilities exist that could cause a query with a smaller data volume to utilize the primary key index? Research indicates that under a single-column index, the primary key may not be completely ordered,is it right?
Why did it not use the expected index(c), and why is the estimated number of scanned rows so low when using the primary key?
After increasing the limit, it will reach the expected index.
I discovered an interesting issue. In Problem 1, when using a primary key index, gradually increasing the limit results in a corresponding increase in rows. For example, with limit=1, rows=8; with limit=2, rows=16, and so on, until a certain threshold. I suspect this is because the estimated rows for the primary key exceed the estimated scan rows for the index, after which the index is chosen. Additionally, I found that the estimated scan rows for the primary key equals limit multiplied by (total rows / estimated rows for index idx_month_b).
rows=limit*(total rows/estimated rows for index idx_month_b)
However, I couldn’t find a clear explanation for how to calculate the estimated rows for index idx_month_b.
2
Answers
Sorry, my question might not have been clear enough. My MySQL version is 5.7. In Question 1, I understand the solution to the problem. What I want to explore is the optimizer's basis for choosing the primary key index instead of idx_month_b. During this exploration, I found this question very interesting.In Problem 1, when using a primary key index, gradually increasing the limit results in a corresponding increase in rows. For example, with limit=1, rows=8; with limit=2, rows=16, and so on, until a certain threshold. I suspect this is because the estimated rows for the primary key exceed the estimated scan rows for the index, after which the index is chosen. Additionally, I found that the estimated scan rows for the primary key equals limit multiplied by (total rows / estimated rows for index idx_month_b).
So I want to know how the estimated number of scanned rows for the index idx_month_b is obtained(Besides requiring additional sorting and back-table).When the data is uniformly distributed, rows = total number of rows / index cardinality. But when it is uneven, in this case, it is not uniform. Additionally, I believe the optimizer considers the LIMIT when calculating costs. In my own experiments, I found that LIMIT significantly affects the estimated number of scanned rows.
In Question 2, I share the same guess as yours: that the latest statistics were not obtained. However, the production environment does not allow for the operation of ANALYZE.
Problem 1:
Needs
INDEX(d, month, id)
— in this orderNote that
month<='2024-07-31'
does not catch things on the last day of July. Suggest:c=’test2′, there are 2,900, but using
PRIMARY
— DoANALYZE TABLE test;
to see if it starts usingINDEX(c)
(as it should).ANALYZE
is rarely needed; here it sounds like had not gotten the statistics "right".Also, the Optimizer usually does not notice the existence of
LIMIT
when deciding what to do.What version of MySQL? File a bug report if my answer is not sufficient: bugs.mysql.com .
The cardinality of
month
and the index starting withmonth
seem irrelevant to your 3 test cases.