A table (phototag) consists of two foreign key columns (photoid, tagid). I want to get the most related photos based on their tags.
There are 4 million photos with 5-10 tags. For example, the photoid 10009 has 6 tags. I need photos that have similar tags.
SELECT photoid
FROM phototag
WHERE photoid != 10009
AND tagid IN (21192, 3501, 35286, 21269, 16369, 48136)
GROUP BY photoid
ORDER BY COUNT(photoid) DESC
LIMIT 24;
Without ORDER BY COUNT
the query is super fast.
I tried but no result:
- Optimizing table
- making a primary key based on two columns
- indexing columns separately
- switching InnoDB to MyISAM
UPDATE
Here is the result with explain
prefix
4
Answers
Count queries are notoriously difficult to optimize. However, we should be able to index the
WHERE
clause. Consider trying the following compound index:This index covers the
WHERE
clause, as well as theSELECT
, and should be usable while aggregating withGROUP BY
.Another totally different approach would be to create a summary table. This table would take the following structure:
Every time you insert a new record into
phototag
which is not one of the blacklistedtagid
values, you would also update the summary table. For example, assuming you had just inserted a whitelisted tag forphotoid
10001, you would execute the following update:Now, with this summary table in hand, your original query just becomes:
Note that there is no aggregation needed. Instead, the aggregation is happening all along the way in the summary table.
Without ORDER BY clause, if the database finds 24 records that satisfy the condition, it may return the results before reading all table records.
With the ORDER BY clause, the database must read all records in order to sort them according to the count result.
The IN operator makes queries difficult to optimize. Database needs to repeatedly access the index as many times as the number of variables. Plan generators are easy to induce to do table scans in these situations.
I don’t think creating index is helpful. However,
phototag(tagid, photoid)
is better thanphototag(photoid, tagid)
because the database cannot use the index when the condition is negative.If you change what you mean by "similar", you can reduce the amount of work the database engine needs to do. (Mostly, avoiding sorting the "score" for all or most of you 4 million photos.
For example, you could pick all photos that share your target photo’s one "rarest" tag.
With indexes on both
(tagid, photoid)
and(photoid, tagid)
.This massively reduces the number of rows that need to be processed and makes that processing easier.
But…
It is not what you originally asked for, and it’s possible that it will return zero matching photos.
What it is…
An example of a different definition of "similar" that can reduce the amount of data being processed, leading to faster queries.
So, there’s your trade off…
Either:
A summary table would be best (cf Summary Tables ). But, without such:
and
Notes: Do not include a id; do use composite indexing.
I find that having the tag as a string, not an id simplifies the coding and possibly speeds up execution — rather than first having to find the tagids. Think of it this way: An
INT
is 4 bytes, but what is theAVG(LENGTH(tag))
? Possibly not much more than 4.Consider
FULLTEXT
:If, in the photo table (or a parallel table), you hade a
tags TEXT
column andFULLTEXT(tags)
, then this might work much faster:(Be cautious of min token length and stopwords.)
(It may sort better with
IN BOOLEAN MODE
and/orNATURAL LANGUAGE
.)