I’d like to understand why the following 2 queries that select duplicates from the companies
table have different execution times. The second query with the JOIN
is executed much faster.
The query is executed on:
- a
companies
table with ~200k records - of which there are 195k unique
name
column values (each query returns ~1500 results, namely the duplicates) - there is a
nameindex
index on thename
column
WHERE (0.31142875 s):
SELECT *
FROM `companies`
WHERE `name` IN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
);
JOIN (0.07034850 s):
SELECT *
FROM `companies`
INNER JOIN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;
Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?
The output from EXPLAIN
is:
WHERE query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | companies | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | Using where |
2 | SUBQUERY | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |
JOIN query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | NULL | |
1 | PRIMARY | companies | NULL | ref | nameindex | nameindex | 1022 | duplicate_names.name | 1 | 100.00 | NULL |
2 | DERIVED | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |
2
Answers
It looks like subquery materialization would be more efficient for the 1st statement, but the optimizer chooses
SUBQUERY
execution strategy.Are the statistics collected? If not, try collecting them:
ANALYZE TABLE companies
and re-running the 1st statement.If you’re on MySQL 8.0 or later you can use the optimizer hint:
In my opinion, the choice of an execution plan, as well as the preference of a particular plan, strongly depends on the selectivity of the index.
I tried to create a test table with a different number of duplicates and compared the execution plans and query performance on this table.
Results:
For low part of duplicates. In OP table 1500/300000=0.5%
Plan for
JOIN
For
IN
While duplicate names part is large, plan is different.
JOIN
IN
The query execution time also depends on other factors, especially on such a small table of 5000 rows. But, nevertheless, the assumption that a particular execution plan is preferable to the selectivity of the data is confirmed.
Fiddle here
Test table