skip to Main Content

I’d like to understand why the following 2 queries that select duplicates from the companies table have different execution times. The second query with the JOIN is executed much faster.

The query is executed on:

  • a companies table with ~200k records
  • of which there are 195k unique name column values (each query returns ~1500 results, namely the duplicates)
  • there is a nameindex index on the name column

WHERE (0.31142875 s):

SELECT *
FROM `companies`
WHERE `name` IN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
);

JOIN (0.07034850 s):

SELECT *
FROM `companies`
INNER JOIN (
    SELECT `name`
    FROM `companies`
    GROUP BY `name`
    HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;  

Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?

The output from EXPLAIN is:

WHERE query:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY companies NULL ALL NULL NULL NULL NULL 195258 100.00 Using where
2 SUBQUERY companies NULL index nameindex nameindex 1022 NULL 195258 100.00 Using index

JOIN query:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL ALL NULL NULL NULL NULL 195258 100.00 NULL
1 PRIMARY companies NULL ref nameindex nameindex 1022 duplicate_names.name 1 100.00 NULL
2 DERIVED companies NULL index nameindex nameindex 1022 NULL 195258 100.00 Using index

2

Answers


  1. It looks like subquery materialization would be more efficient for the 1st statement, but the optimizer chooses SUBQUERY execution strategy.

    Are the statistics collected? If not, try collecting them: ANALYZE TABLE companies and re-running the 1st statement.

    If you’re on MySQL 8.0 or later you can use the optimizer hint:

    SELECT *
    FROM `companies`
    WHERE `name` IN (
        SELECT /*+ SUBQUERY(MATERIALIZATION) */ `name`
        FROM `companies`
        GROUP BY `name`
        HAVING COUNT(`name`) > 1
    );
    
    Login or Signup to reply.
  2. In my opinion, the choice of an execution plan, as well as the preference of a particular plan, strongly depends on the selectivity of the index.
    I tried to create a test table with a different number of duplicates and compared the execution plans and query performance on this table.

    Results:
    For low part of duplicates. In OP table 1500/300000=0.5%

    Total rows Duplicates Join In
    15000 62 (0.4%) 0.01211000 0.02565750
    15000 118 0.01559225 0.02836675
    15000 133 0.01403475 0.02834625
    15000 288 0.02084200 0.02859325
    15000 1015 0.05302300 0.06872775
    15000 1122 0.05069350 0.05169975
    15000 2085 0.04678600 0.04377050
    15000 3744 0.05224775 0.05224775

    Plan for JOIN

    -> Nested loop inner join  (cost=12028 rows=14888) (actual time=10.8..11.7 rows=124 loops=1)
        -> Table scan on duplicate_names  (cost=3843..3950 rows=8410) (actual time=10.7..10.8 rows=62 loops=1)
            -> Materialize  (cost=3843..3843 rows=8410) (actual time=10.7..10.7 rows=62 loops=1)
                -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0488..10.7 rows=62 loops=1)
                    -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.045..9.72 rows=14938 loops=1)
                        -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0379..4.67 rows=15000 loops=1)
        -> Index lookup on companies using ix_name (name=duplicate_names.`name`)  (cost=0.443 rows=1.77) (actual time=0.014..0.0152 rows=2 loops=62)
    

    For IN

    -> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=1513 rows=14888) (actual time=11.1..37.2 rows=124 loops=1)
        -> Table scan on companies  (cost=1513 rows=14888) (actual time=0.0463..6.53 rows=15000 loops=1)
        -> Select #2 (subquery in condition; run only once)
            -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=3843..3843 rows=1) (actual time=0.00161..0.00161 rows=0.00827 loops=15001)
                -> Limit: 1 row(s)  (cost=3843..3843 rows=1) (actual time=0.00146..0.00146 rows=0.00827 loops=15001)
                    -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.00132..0.00132 rows=0.00827 loops=15001)
                        -> Materialize with deduplication  (cost=3843..3843 rows=8410) (actual time=10.9..10.9 rows=62 loops=1)
                            -> Filter: (count(0) > 1)  (cost=3002 rows=8410) (actual time=0.0351..10.8 rows=62 loops=1)
                                -> Group aggregate: count(0)  (cost=3002 rows=8410) (actual time=0.0314..9.89 rows=14938 loops=1)
                                    -> Covering index scan on companies using ix_name  (cost=1513 rows=14888) (actual time=0.0265..4.67 rows=15000 loops=1)
    

    While duplicate names part is large, plan is different.

    JOIN

    EXPLAIN
    -> Nested loop inner join  (cost=13500 rows=5000) (actual time=10.5..31.4 rows=4972 loops=1)
        -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0496..4.7 rows=5000 loops=1)
        -> Covering index lookup on duplicate_names using <auto_key0> (name=companies.`name`)  (cost=1001..1003 rows=10) (actual time=0.0046..0.00508 rows=0.994 loops=5000)
            -> Materialize  (cost=1000..1000 rows=1) (actual time=10.3..10.3 rows=967 loops=1)
                -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.0435..5.78 rows=967 loops=1)
                    -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.0396..5.51 rows=995 loops=1)
                        -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0329..2.33 rows=5000 loops=1)
    

    IN

    EXPLAIN
    -> Filter: <in_optimizer>(companies.`name`,companies.`name` in (select #2))  (cost=500 rows=5000) (actual time=8.01..28 rows=4972 loops=1)
        -> Table scan on companies  (cost=500 rows=5000) (actual time=0.0621..3.37 rows=5000 loops=1)
        -> Select #2 (subquery in condition; run only once)
            -> Filter: ((companies.`name` = `<materialized_subquery>`.`name`))  (cost=1000..1000 rows=1) (actual time=0.00407..0.00407 rows=0.994 loops=4994)
                -> Limit: 1 row(s)  (cost=1000..1000 rows=1) (actual time=0.00313..0.00313 rows=0.994 loops=4994)
                    -> Index lookup on <materialized_subquery> using <auto_distinct_key> (name=companies.`name`)  (actual time=0.0029..0.0029 rows=0.994 loops=4994)
                        -> Materialize with deduplication  (cost=1000..1000 rows=1) (actual time=7.79..7.79 rows=967 loops=1)
                            -> Filter: (count(0) > 1)  (cost=1000 rows=1) (actual time=0.107..7.02 rows=967 loops=1)
                                -> Group aggregate: count(0)  (cost=1000 rows=1) (actual time=0.103..6.82 rows=995 loops=1)
                                    -> Covering index scan on companies using ix_name  (cost=500 rows=5000) (actual time=0.0936..3.33 rows=5000 loops=1)
    
    Total rows Duplicates Join In
    10000 3125 0.05110900 0.03762375
    5000 1816 0.02856050 0.02632475
    5000 1794 0.08704400 0.02750950
    5000 1862 0.06159625 0.03077825
    5000 2844 0.06492925 0.03758150
    5000 4003 0.05222750 0.05222750
    5000 4007 0.06000325 0.04444525
    5000 4007 0.04093250 0.03153725
    5000 4004 0.02923175 0.04555700
    5000 4005 0.04226600 0.03515675

    The query execution time also depends on other factors, especially on such a small table of 5000 rows. But, nevertheless, the assumption that a particular execution plan is preferable to the selectivity of the data is confirmed.

    Fiddle here

    Test table

    create table companies (id int primary key auto_increment
          , name  varchar(50) not null,Address varchar(100)
          ,LegalName varchar(100));
    create index ix_name on companies (name);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search