ORDER BY COUNT is slow - Mysql

Ulugbek
February 1, 2023
251 views
4 votes
4 Answers

A table (phototag) consists of two foreign key columns (photoid, tagid). I want to get the most related photos based on their tags.

There are 4 million photos with 5-10 tags. For example, the photoid 10009 has 6 tags. I need photos that have similar tags.

SELECT photoid
FROM phototag
WHERE photoid != 10009
  AND tagid IN (21192, 3501, 35286, 21269, 16369, 48136)
GROUP BY photoid
ORDER BY COUNT(photoid) DESC
LIMIT 24;

Without ORDER BY COUNT the query is super fast.

I tried but no result:

Optimizing table
making a primary key based on two columns
indexing columns separately
switching InnoDB to MyISAM

UPDATE

Here is the result with explain prefix

Answers

- TimBiegeleisen
- February 1, 2023 at 8:06 am
- 0 votes
0
Count queries are notoriously difficult to optimize. However, we should be able to index the WHERE clause. Consider trying the following compound index:
```
CREATE INDEX idx ON phototag (photoid, tagid);
```
This index covers the WHERE clause, as well as the SELECT, and should be usable while aggregating with GROUP BY.

Another totally different approach would be to create a summary table. This table would take the following structure:
```
photoid | count
10000   | 25
10001   | 300
...     | ...
```
Every time you insert a new record into phototag which is not one of the blacklisted tagid values, you would also update the summary table. For example, assuming you had just inserted a whitelisted tag for photoid 10001, you would execute the following update:
```
UPDATE summary
SET count = count + 1
WHERE photoid = 10000;
```
Now, with this summary table in hand, your original query just becomes:
```
SELECT photoid
FROM summary
WHERE photoid != 10009
ORDER BY count DESC
LIMIT 24;
```
Note that there is no aggregation needed. Instead, the aggregation is happening all along the way in the summary table.
Login or Signup to reply.

- lunalcni
- February 1, 2023 at 8:47 am
- 0 votes
0
Without ORDER BY clause, if the database finds 24 records that satisfy the condition, it may return the results before reading all table records.
With the ORDER BY clause, the database must read all records in order to sort them according to the count result.

The IN operator makes queries difficult to optimize. Database needs to repeatedly access the index as many times as the number of variables. Plan generators are easy to induce to do table scans in these situations.

I don’t think creating index is helpful. However, phototag(tagid, photoid) is better than phototag(photoid, tagid) because the database cannot use the index when the condition is negative.

Login or Signup to reply.

- MatBailie
- February 1, 2023 at 10:56 am
- 0 votes
0
If you change what you mean by "similar", you can reduce the amount of work the database engine needs to do. (Mostly, avoiding sorting the "score" for all or most of you 4 million photos.

For example, you could pick all photos that share your target photo’s one "rarest" tag.
```
SELECT *
  FROM phototag
 WHERE photoid != 10009
   AND tagid = (
         SELECT tagid
           FROM phototag
          WHERE tagid IN (
                  SELECT tagid
                    FROM phototag
                   WHERE photoid = 10009
                )
       GROUP BY tagid
       ORDER BY COUNT(*), tagid
          LIMIT 1
       )
```
With indexes on both (tagid, photoid) and (photoid, tagid).

This massively reduces the number of rows that need to be processed and makes that processing easier.

But…

It is not what you originally asked for, and it’s possible that it will return zero matching photos.

What it is…

An example of a different definition of "similar" that can reduce the amount of data being processed, leading to faster queries.

So, there’s your trade off…

Either:
- you accept the slow performance of a query that reads and re-sorts pretty much the whole table
- you redefine "similar" to something "easier to solve" but probably "less good"
Login or Signup to reply.

- RickJames
- February 1, 2023 at 9:30 pm
- 0 votes
0
A summary table would be best (cf Summary Tables ). But, without such:
```
SELECT photoid
    FROM phototag
    WHERE photoid != 10009
      AND tag IN ('this', 'that, ...)
    GROUP BY photoid
    ORDER BY COUNT(*) DESC
    LIMIT 24;
```
and
```
CREATE TABLE phototag (
    photoid INT UNSIGNED NOT NULL,
    tag VARCHAR(99),
    PRIMARY KEY(photoid, tag),
    INDEX(tag, photoid)
) ENGINE=InnoDB
```
Notes: Do not include a id; do use composite indexing.

I find that having the tag as a string, not an id simplifies the coding and possibly speeds up execution — rather than first having to find the tagids. Think of it this way: An INT is 4 bytes, but what is the AVG(LENGTH(tag))? Possibly not much more than 4.

Consider FULLTEXT:

If, in the photo table (or a parallel table), you hade a tags TEXT column and FULLTEXT(tags), then this might work much faster:
```
SELECT photoid
    FROM photos
    WHERE    MATCH(tags) AGAINST("this that ...")
    ORDER BY MATCH(tags) AGAINST("this that ...") DESC
    HAVING photoid != 10009
    LIMIT 24;
```
(Be cautious of min token length and stopwords.)

(It may sort better with IN BOOLEAN MODE and/or NATURAL LANGUAGE.)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

ORDER BY COUNT is slow – Mysql

Answers