MySQL Where vs INNER JOIN performance

Flame
June 3, 2024
160 views
1 vote
2 Answers

I’d like to understand why the following 2 queries that select duplicates from the companies table have different execution times. The second query with the JOIN is executed much faster.

The query is executed on:

a companies table with ~200k records
of which there are 195k unique name column values (each query returns ~1500 results, namely the duplicates)
there is a nameindex index on the name column

WHERE (0.31142875 s):

SELECT *
FROM `companies`
WHERE `name` IN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
);

JOIN (0.07034850 s):

SELECT *
FROM `companies`
INNER JOIN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;

Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?

The output from EXPLAIN is:

WHERE query:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	companies	NULL	ALL	NULL	NULL	NULL	NULL	195258	100.00	Using where
2	SUBQUERY	companies	NULL	index	nameindex	nameindex	1022	NULL	195258	100.00	Using index

JOIN query:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY		NULL	ALL	NULL	NULL	NULL	NULL	195258	100.00	NULL
1	PRIMARY	companies	NULL	ref	nameindex	nameindex	1022	duplicate_names.name	1	100.00	NULL
2	DERIVED	companies	NULL	index	nameindex	nameindex	1022	NULL	195258	100.00	Using index

Tags: mysql sql

Answers

- OlegSmirnov
- June 3, 2024 at 5:07 am
- 0 votes
0
It looks like subquery materialization would be more efficient for the 1st statement, but the optimizer chooses SUBQUERY execution strategy.

Are the statistics collected? If not, try collecting them: ANALYZE TABLE companies and re-running the 1st statement.

If you’re on MySQL 8.0 or later you can use the optimizer hint:
```
SELECT *
FROM `companies`
WHERE `name` IN (
    SELECT /*+ SUBQUERY(MATERIALIZATION) */ `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
);
```
Login or Signup to reply.

In my opinion, the choice of an execution plan, as well as the preference of a particular plan, strongly depends on the selectivity of the index.
I tried to create a test table with a different number of duplicates and compared the execution plans and query performance on this table.

Results:
For low part of duplicates. In OP table 1500/300000=0.5%

Total rows	Duplicates	Join	In
15000	62 (0.4%)	0.01211000	0.02565750
15000	118	0.01559225	0.02836675
15000	133	0.01403475	0.02834625
15000	288	0.02084200	0.02859325
15000	1015	0.05302300	0.06872775
15000	1122	0.05069350	0.05169975
15000	2085	0.04678600	0.04377050
15000	3744	0.05224775	0.05224775

Plan for JOIN

-> Nested loop inner join  (cost=12028 rows=14888) (actual time=10.8..11.7 rows=124 loops=1)
    -> Table scan on duplicate_names  (cost=3843..3950 rows=8410) (actual time=10.7..10.8 rows=62 loops=1)
        -> Materialize  (cost=3843..3843 rows=8410) (actual time=10.7..10.7 rows=62 loops=1)
            -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0488..10.7 rows=62 loops=1)
                -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.045..9.72 rows=14938 loops=1)
                    -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0379..4.67 rows=15000 loops=1)
    -> Index lookup on companies using ix_name (name=duplicate_names.`name`)  (cost=0.443 rows=1.77) (actual time=0.014..0.0152 rows=2 loops=62)

For IN

-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=1513 rows=14888) (actual time=11.1..37.2 rows=124 loops=1)
    -> Table scan on companies  (cost=1513 rows=14888) (actual time=0.0463..6.53 rows=15000 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=3843..3843 rows=1) (actual time=0.00161..0.00161 rows=0.00827 loops=15001)
            -> Limit: 1 row(s)  (cost=3843..3843 rows=1) (actual time=0.00146..0.00146 rows=0.00827 loops=15001)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.00132..0.00132 rows=0.00827 loops=15001)
                    -> Materialize with deduplication  (cost=3843..3843 rows=8410) (actual time=10.9..10.9 rows=62 loops=1)
                        -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0351..10.8 rows=62 loops=1)
                            -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.0314..9.89 rows=14938 loops=1)
                                -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0265..4.67 rows=15000 loops=1)

While duplicate names part is large, plan is different.

JOIN

EXPLAIN
-> Nested loop inner join  (cost=13500 rows=5000) (actual time=10.5..31.4 rows=4972 loops=1)
    -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0496..4.7 rows=5000 loops=1)
    -> Covering index lookup on duplicate_names using <auto_key0> (name=companies.`name`)  (cost=1001..1003 rows=10) (actual time=0.0046..0.00508 rows=0.994 loops=5000)
        -> Materialize  (cost=1000..1000 rows=1) (actual time=10.3..10.3 rows=967 loops=1)
            -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.0435..5.78 rows=967 loops=1)
                -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.0396..5.51 rows=995 loops=1)
                    -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0329..2.33 rows=5000 loops=1)

EXPLAIN
-> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=500 rows=5000) (actual time=8.01..28 rows=4972 loops=1)
    -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0621..3.37 rows=5000 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=1000..1000 rows=1) (actual time=0.00407..0.00407 rows=0.994 loops=4994)
            -> Limit: 1 row(s)  (cost=1000..1000 rows=1) (actual time=0.00313..0.00313 rows=0.994 loops=4994)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.0029..0.0029 rows=0.994 loops=4994)
                    -> Materialize with deduplication  (cost=1000..1000 rows=1) (actual time=7.79..7.79 rows=967 loops=1)
                        -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.107..7.02 rows=967 loops=1)
                            -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.103..6.82 rows=995 loops=1)
                                -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0936..3.33 rows=5000 loops=1)

Total rows	Duplicates	Join	In
10000	3125	0.05110900	0.03762375
5000	1816	0.02856050	0.02632475
5000	1794	0.08704400	0.02750950
5000	1862	0.06159625	0.03077825
5000	2844	0.06492925	0.03758150
5000	4003	0.05222750	0.05222750
5000	4007	0.06000325	0.04444525
5000	4007	0.04093250	0.03153725
5000	4004	0.02923175	0.04555700
5000	4005	0.04226600	0.03515675

The query execution time also depends on other factors, especially on such a small table of 5000 rows. But, nevertheless, the assumption that a particular execution plan is preferable to the selectivity of the data is confirmed.

Fiddle here

Test table

create table companies (id int primary key auto_increment
      , name  varchar(50) not null,Address varchar(100)
      ,LegalName varchar(100));
create index ix_name on companies (name);

Please signup or login to give your own answer.

Click here to cancel reply.