I have a vehicle table with many rows. Every vehicle is stored with a name, and then additional characterizing attributes. Let’s say
name PRIMARY KEY
attr_1
attr_2
attr_3
attr_4
etc...
The user can search via a web form by name and then add filters for each attribute if they wish.
I want to add some DB indexes to improve search performance for every use-case.
The user query could result into any combination of columns being added to the WHERE clause. For example:
SELECT * FROM vehicles WHERE name = ?
SELECT * FROM vehicles WHERE name = ? AND attr_1 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_3 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_1 = ? AND attr_2 = ? AND attr_4 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_4 = ?
etc...
How should I best set up the indexes? As individual single column indexes? Or as multi-column indexes?
Do I need to cover all combinations, e.g.
Index for name, attr_1, attr_2, attr_3, attr_4
Index for name, attr_2, attr_3, attr_4
Index for name, attr_3, attr_4
Index for name, attr_4
Index for name, attr_1, attr_3, attr_4
etc...
Or do I analyze for which filter combinations happen most often and only target indexes for those use cases? I don’t know if there is some elegant solution for this problem?
2
Answers
If you define multi-column indexes, they only help reduce the examined rows if the search matches the leftmost set of columns of the index.
For example, suppose you have an index on (name, attr_1, attr_2, attr_3). Then someone does a search that skips
attr_2
:This will narrow down the rows by
name
andattr_1
, but not byattr_3
. That latter column is not contiguous with the other columns of the index being used for this query.This shows that you would need at least an index for every combination of attribute columns. This could quickly get out of hand, because InnoDB is limited to 64 secondary indexes per table.
There are also issues with indexing when the conditions are not
=
, but inequality or range conditions. Then the order of columns in the index is important, so you end up needing every permutation of columns, which is even more.Even if you have fewer than 64 indexes, say 49 indexes, that’s too many indexes. Indexes multiple the cost of writes to the table, so when you insert/update/delete, it’ll be ~50 times slower than if you had no indexes (partially mitigated by the change buffer).
That would be one approach for a compromise. It would help to improve performance for a subset of the queries users run. But other queries will not be optimized as well.
It might be good enough for the index to target a specific subset of rows by
name
alone. Within that subset, the other conditions would be evaluated row-by-row, instead of optimized with indexes. But the subset it needs to examine might be sufficiently "pruned" that most users will be satisfied with the performance. It isn’t truly optimized as far as it could be, but maybe that’s not strictly required.Edit: What I mean is to create an index on
name
only. All queries whose conditions includeWHERE name=?
would use this index, but it would examine all rows for that name, evaluating any other conditions of the query and discarding those that don’t match the conditions.Ultimately, you end up with this conclusion: you can’t optimize everything.
Allowing users to determine their own arbitrary search criteria doesn’t work well with B-tree indexes.
The alternatives are:
Let some of your users suffer with poor performance if they choose an uncommon combination of conditions. Good luck explaining this to them.
Limit the users’ choices in your user interface, so they can choose only combinations that are supported by the indexes you created. They might not be happy with the limited choices, but all the choices they are given will be optimized.
Use a different indexing technology that allows arbitrary combinations of columns. The R-tree searches in Apache Lucene (or derivative products like Apache Solr, ElasticSearch, or CrateDB) can do this.
Lump all the attribute values into a single
TEXT
column. Have aFULLTEXT
index on that column. Thenmay come close to what you want, and do it very fast. There are a variety of issues.
Things with a "range" need to be handled by other means —
will do the FT test first, then filter on price. It gets the desired result reasonably fast.
(For a pure EAV, Bill’s answer is good. I discuss it in Entity-Attribute-Value .)