I have a couple of tables that looks like this:
CREATE TABLE Entities (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
client_id INT NOT NULL,
display_name VARCHAR(45),
PRIMARY KEY (id)
)
CREATE TABLE Statuses (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE EventTypes (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE Events (
id INT NOT NULL AUTO_INCREMENT,
entity_id INT NOT NULL,
date DATE NOT NULL,
event_type_id INT NOT NULL,
status_id INT NOT NULL
)
Events is large > 100,000,000 rows
Entities, Statuses and EventTypes are small < 300 rows a piece
I have several indexes on Events, but the ones that come into play are
idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)
I have a large complicated query, but I’m getting the same slow query results with a simpler one like the one below (note, in the real queries, I don’t use evt.*)
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN @start_date AND @end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = @type_id
For some reason, mysql keeps choosing the index which doesn’t cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN @start_date AND @end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = @type_id
The query takes .014 seconds.
Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?
To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.
Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.
I tried adding an additional index,
ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id)
and it actually made the queries slower.
I will experiment a bit more, but I feel like I’m not sure how the optimizer makes it’s decisions.
Additional Info:
I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec
I can’t wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.
3
Answers
To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:
If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:
I would add this index:
The order of columns is important. Put all column(s) used in equality conditions first. This is
event_type_id
in this case.The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.
Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than
=
orIS NULL
. So range conditions include>
,!=
,BETWEEN
,IN()
,LIKE
(with no leading wildcard),IS NOT NULL
, and so on.The condition on
entity_id
is also an equality condition if theIN()
list has one element. MySQL’s optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed ofIN (19)
is typical, then all three columns of the index will be used for filtering.It’s still worth putting
date
in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It’s not quite as good as a real index lookup, but it’s worthwhile.I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.
Rearrange your indexes to have columns in this order:
=
orIS NULL
.IN
— If there is a single value, this will be further optimized to=
for you.date
.Note that nothing after a "range" test will be used by
WHERE
.(There are exceptions, but most are not relevant here.)
More discussion: Index Cookbook
Since the tables smell like Data Warehousing, I suggest looking into
Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.
Since Events is rather large, I suggest using smaller numbers where practical.
INT
takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.When you have
INDEX(a,b,c)
, that index will handle cases that needINDEX(a,b)
andINDEX(a)
. Keep the longer one. (Sometimes the Optimizer picks the shorter index ‘erroneously’.)