skip to Main Content

I have a vehicle table with many rows. Every vehicle is stored with a name, and then additional characterizing attributes. Let’s say

name PRIMARY KEY
attr_1
attr_2
attr_3
attr_4
etc...

The user can search via a web form by name and then add filters for each attribute if they wish.
I want to add some DB indexes to improve search performance for every use-case.

The user query could result into any combination of columns being added to the WHERE clause. For example:

SELECT * FROM vehicles WHERE name = ?
SELECT * FROM vehicles WHERE name = ? AND attr_1 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_3 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_1 = ? AND attr_2 = ? AND attr_4 = ?
SELECT * FROM vehicles WHERE name = ? AND attr_4 = ?
etc...

How should I best set up the indexes? As individual single column indexes? Or as multi-column indexes?

Do I need to cover all combinations, e.g.

Index for name, attr_1, attr_2, attr_3, attr_4
Index for name, attr_2, attr_3, attr_4
Index for name, attr_3, attr_4
Index for name, attr_4
Index for name, attr_1, attr_3, attr_4
etc...

Or do I analyze for which filter combinations happen most often and only target indexes for those use cases? I don’t know if there is some elegant solution for this problem?

2

Answers


  1. If you define multi-column indexes, they only help reduce the examined rows if the search matches the leftmost set of columns of the index.

    For example, suppose you have an index on (name, attr_1, attr_2, attr_3). Then someone does a search that skips attr_2:

    SELECT ... WHERE name = ? AND attr_1 = ? AND attr_3 = ?
    

    This will narrow down the rows by name and attr_1, but not by attr_3. That latter column is not contiguous with the other columns of the index being used for this query.

    This shows that you would need at least an index for every combination of attribute columns. This could quickly get out of hand, because InnoDB is limited to 64 secondary indexes per table.

    There are also issues with indexing when the conditions are not =, but inequality or range conditions. Then the order of columns in the index is important, so you end up needing every permutation of columns, which is even more.

    Even if you have fewer than 64 indexes, say 49 indexes, that’s too many indexes. Indexes multiple the cost of writes to the table, so when you insert/update/delete, it’ll be ~50 times slower than if you had no indexes (partially mitigated by the change buffer).

    do I analyze for which filter combinations happen most often and only target indexes for those use cases?

    That would be one approach for a compromise. It would help to improve performance for a subset of the queries users run. But other queries will not be optimized as well.

    It might be good enough for the index to target a specific subset of rows by name alone. Within that subset, the other conditions would be evaluated row-by-row, instead of optimized with indexes. But the subset it needs to examine might be sufficiently "pruned" that most users will be satisfied with the performance. It isn’t truly optimized as far as it could be, but maybe that’s not strictly required.

    Edit: What I mean is to create an index on name only. All queries whose conditions include WHERE name=? would use this index, but it would examine all rows for that name, evaluating any other conditions of the query and discarding those that don’t match the conditions.


    Ultimately, you end up with this conclusion: you can’t optimize everything.

    Allowing users to determine their own arbitrary search criteria doesn’t work well with B-tree indexes.

    The alternatives are:

    • Let some of your users suffer with poor performance if they choose an uncommon combination of conditions. Good luck explaining this to them.

    • Limit the users’ choices in your user interface, so they can choose only combinations that are supported by the indexes you created. They might not be happy with the limited choices, but all the choices they are given will be optimized.

    • Use a different indexing technology that allows arbitrary combinations of columns. The R-tree searches in Apache Lucene (or derivative products like Apache Solr, ElasticSearch, or CrateDB) can do this.

    Login or Signup to reply.
  2. Lump all the attribute values into a single TEXT column. Have a FULLTEXT index on that column. Then

    WHERE MATCH(that_column) AGAINST ("green 4wd")
    

    may come close to what you want, and do it very fast. There are a variety of issues.

    • 2-tone vehicle where one color is green will be included in the list.
    • there are syntax limitations of FULLTEXT’s definition of "word".

    Things with a "range" need to be handled by other means —

    WHERE price BETWEEN ... AND ...
      AND MATCH(col) AGAINST (...)
    

    will do the FT test first, then filter on price. It gets the desired result reasonably fast.

    (For a pure EAV, Bill’s answer is good. I discuss it in Entity-Attribute-Value .)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search