mysql index selection on large table

bpeikes
December 30, 2022
161 views
3 votes
3 Answers

I have a couple of tables that looks like this:

CREATE TABLE Entities (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   client_id INT NOT NULL,
   display_name VARCHAR(45),
   PRIMARY KEY (id)
)

CREATE TABLE Statuses (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE EventTypes (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE Events (
   id INT NOT NULL AUTO_INCREMENT,
   entity_id INT NOT NULL,
   date DATE NOT NULL,
   event_type_id INT NOT NULL,
   status_id INT NOT NULL
)

Events is large > 100,000,000 rows

Entities, Statuses and EventTypes are small < 300 rows a piece

I have several indexes on Events, but the ones that come into play are

idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)

I have a large complicated query, but I’m getting the same slow query results with a simpler one like the one below (note, in the real queries, I don’t use evt.*)

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

For some reason, mysql keeps choosing the index which doesn’t cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

The query takes .014 seconds.

Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?

To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.

Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.

I tried adding an additional index,
ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id)
and it actually made the queries slower.

I will experiment a bit more, but I feel like I’m not sure how the optimizer makes it’s decisions.

Additional Info:

I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec

I can’t wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.

Tags: mysql

Answers

- ysth
- December 30, 2022 at 5:17 pm
- 0 votes
0
To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:
```
select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
```
If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:
```
select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between @start_date and @end_date
```
Login or Signup to reply.

- BillKarwin
- December 30, 2022 at 6:23 pm
- 0 votes
0
I would add this index:
```
ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);
```
The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.

The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.

Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.

The condition on entity_id is also an equality condition if the IN() list has one element. MySQL’s optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.

It’s still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It’s not quite as good as a real index lookup, but it’s worthwhile.

I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.
Login or Signup to reply.

- RickJames
- December 31, 2022 at 8:21 pm
- 0 votes
0
Rearrange your indexes to have columns in this order:
1. Any column(s) that will be tested with = or IS NULL.
2. Column(s) tested with IN — If there is a single value, this will be further optimized to = for you.
3. One "range" column, such as your date.
Note that nothing after a "range" test will be used by WHERE.

(There are exceptions, but most are not relevant here.)
- More discussion: Index Cookbook
- Since the tables smell like Data Warehousing, I suggest looking into
  Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.
- Since Events is rather large, I suggest using smaller numbers where practical. INT takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.
- When you have INDEX(a,b,c), that index will handle cases that need INDEX(a,b) and INDEX(a). Keep the longer one. (Sometimes the Optimizer picks the shorter index ‘erroneously’.)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.