skip to Main Content

A table (phototag) consists of two foreign key columns (photoid, tagid). I want to get the most related photos based on their tags.

There are 4 million photos with 5-10 tags. For example, the photoid 10009 has 6 tags. I need photos that have similar tags.

SELECT photoid
FROM phototag
WHERE photoid != 10009
  AND tagid IN (21192, 3501, 35286, 21269, 16369, 48136)
GROUP BY photoid
ORDER BY COUNT(photoid) DESC
LIMIT 24;

Without ORDER BY COUNT the query is super fast.

I tried but no result:

  • Optimizing table
  • making a primary key based on two columns
  • indexing columns separately
  • switching InnoDB to MyISAM

UPDATE

Here is the result with explain prefix

enter image description here

4

Answers


  1. Count queries are notoriously difficult to optimize. However, we should be able to index the WHERE clause. Consider trying the following compound index:

    CREATE INDEX idx ON phototag (photoid, tagid);
    

    This index covers the WHERE clause, as well as the SELECT, and should be usable while aggregating with GROUP BY.

    Another totally different approach would be to create a summary table. This table would take the following structure:

    photoid | count
    10000   | 25
    10001   | 300
    ...     | ...
    

    Every time you insert a new record into phototag which is not one of the blacklisted tagid values, you would also update the summary table. For example, assuming you had just inserted a whitelisted tag for photoid 10001, you would execute the following update:

    UPDATE summary
    SET count = count + 1
    WHERE photoid = 10000;
    

    Now, with this summary table in hand, your original query just becomes:

    SELECT photoid
    FROM summary
    WHERE photoid != 10009
    ORDER BY count DESC
    LIMIT 24;
    

    Note that there is no aggregation needed. Instead, the aggregation is happening all along the way in the summary table.

    Login or Signup to reply.
  2. Without ORDER BY clause, if the database finds 24 records that satisfy the condition, it may return the results before reading all table records.
    With the ORDER BY clause, the database must read all records in order to sort them according to the count result.

    The IN operator makes queries difficult to optimize. Database needs to repeatedly access the index as many times as the number of variables. Plan generators are easy to induce to do table scans in these situations.

    I don’t think creating index is helpful. However, phototag(tagid, photoid) is better than phototag(photoid, tagid) because the database cannot use the index when the condition is negative.

    Login or Signup to reply.
  3. If you change what you mean by "similar", you can reduce the amount of work the database engine needs to do. (Mostly, avoiding sorting the "score" for all or most of you 4 million photos.

    For example, you could pick all photos that share your target photo’s one "rarest" tag.

    SELECT *
      FROM phototag
     WHERE photoid != 10009
       AND tagid = (
             SELECT tagid
               FROM phototag
              WHERE tagid IN (
                      SELECT tagid
                        FROM phototag
                       WHERE photoid = 10009
                    )
           GROUP BY tagid
           ORDER BY COUNT(*), tagid
              LIMIT 1
           )
    

    With indexes on both (tagid, photoid) and (photoid, tagid).

    This massively reduces the number of rows that need to be processed and makes that processing easier.

    But…

    It is not what you originally asked for, and it’s possible that it will return zero matching photos.

    What it is…

    An example of a different definition of "similar" that can reduce the amount of data being processed, leading to faster queries.

    So, there’s your trade off…

    Either:

    • you accept the slow performance of a query that reads and re-sorts pretty much the whole table
    • you redefine "similar" to something "easier to solve" but probably "less good"
    Login or Signup to reply.
  4. A summary table would be best (cf Summary Tables ). But, without such:

    SELECT photoid
        FROM phototag
        WHERE photoid != 10009
          AND tag IN ('this', 'that, ...)
        GROUP BY photoid
        ORDER BY COUNT(*) DESC
        LIMIT 24;
    

    and

    CREATE TABLE phototag (
        photoid INT UNSIGNED NOT NULL,
        tag VARCHAR(99),
        PRIMARY KEY(photoid, tag),
        INDEX(tag, photoid)
    ) ENGINE=InnoDB
    

    Notes: Do not include a id; do use composite indexing.

    I find that having the tag as a string, not an id simplifies the coding and possibly speeds up execution — rather than first having to find the tagids. Think of it this way: An INT is 4 bytes, but what is the AVG(LENGTH(tag))? Possibly not much more than 4.

    Consider FULLTEXT:

    If, in the photo table (or a parallel table), you hade a tags TEXT column and FULLTEXT(tags), then this might work much faster:

    SELECT photoid
        FROM photos
        WHERE    MATCH(tags) AGAINST("this that ...")
        ORDER BY MATCH(tags) AGAINST("this that ...") DESC
        HAVING photoid != 10009
        LIMIT 24;
    

    (Be cautious of min token length and stopwords.)

    (It may sort better with IN BOOLEAN MODE and/or NATURAL LANGUAGE.)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search