I am looking to optimize a JOIN between tables that uses a small subset of columns in the second table.
CREATE TABLE `employees` (
`employee_id` bigint NOT NULL,
`manager_id` bigint NOT NULL,
`org_id` bigint NOT NULL,
`union_id` bigint NOT NULL
...
PRIMARY KEY (employee_id),
INDEX (union_id)
);
CREATE TABLE `managers` (
`manager_id` bigint NOT NULL,
`org_id` bigint NOT NULL,
`some_condition` boolean NOT NULL,
PRIMARY KEY (manager_id)
);
Now I want to optimize two types of queries. Both join the tables together on manager_id and org_id, and optionally apply a filter to the some_condition column.
SELECT employees.*
FROM employees
JOIN managers
ON (employees.manager_id = managers.manager_id AND employees.org_id = managers.org_id)
WHERE (employees.union_id = ? AND managers.some_condition);
SELECT employees.*
FROM employees
JOIN managers
ON (employees.manager_id = managers.manager_id AND employees.org_id = managers.org_id)
WHERE (employees.union_id = ?);
Assuming these are very large tables and employees > manager. I am trying to create an index on managers that will speed up the query. Right now the query is slow because for each row it has to read org_id and some_condition. I want to avoid going to disk if possible.
So far I have two indexes that might work:
INDEX `join_index` (`org_id`,`some_condition`)
INDEX `id_join_index` (`manager_id`, `org_id`, `some_condition`)
My main issue is that MySQL does not use either index in the EXPLAIN statement unless I force it to with use index (…).
Which index (if either) will speed up my query, and do I need manager_id in the index to speed up the join if I do not filter on some_condition?
2
Answers
Here are my test results:
Which seem to be the same as yours. The "manager" table here uses the primary key index. Why doesn’t it use the secondary index we created? Because the secondary index only contains data of specified columns. Since the fields in your SQL Join include the primary key "manager_id" field of the "manager" table, if MySQL used the secondary index you created, it would still need to perform a table lookup to query the primary key index to obtain the "manager_id" field. Therefore, MySQL directly chose the primary key index of the "manager" table. Because the primary key index contains a complete row of records, including the "org_id" and "some_condition" fields, it only needs to compare and filter them one by one.
The Optimizer will decide which table to "start with". If it picks
employees
, then these indexes will be optimal:But, since
manager_id
is thePRIMARY KEY
, the above index intomanagers
is useless.The above indexes are reasonably good for either of your queries.
Adding the following may speed up the first query (if the optimizer would prefer to start with
managers
):Rationale:
WHERE
clause. Query 2’sWHERE
references onlyemployees
, so it will start with that table. Query 1 references both tables, so it will do some analysis to guestimate which table is better to start with.ON
andWHERE
(for the second table) are fair game — that is, they should probably be in a composite index.