skip to Main Content

I have a table named "article" containing 1,145,141 records with various fields including

"id," "uiqPID," "tenant," "status," "title," "body," "user_id," "category_id," "created_at," and "updated_at."

The "status" column can have one of three values: ‘PUBLISHED,’ ‘DRAFT,’ or ‘ARCHIVED,’ with the following counts:

  • PUBLISHED: 2
  • DRAFT: 26,145
  • ARCHIVED: 1,118,993

I have set up the following indexes:

  1. PRIMARY index on ‘id’
  2. article_abstract_unq_id index on ‘uiqPID’
  3. article_abstract_unq_id index on ‘tenant’
  4. article_status_idx index on ‘status’
  5. idx_composite_search index on ‘id’
  6. idx_composite_search index on ‘uiqPID’
  7. idx_composite_search index on ‘created_at’

My issue is with the performance of the following query, which took 5.7 seconds to execute:

SELECT 
    a.id AS id,
    a.created_at AS created_at
FROM 
    article a
WHERE 
    a.status = 'ARCHIVED'
ORDER BY a.created_at DESC
LIMIT 50;

However, if I remove the WHERE condition or change it to a.status = ‘DRAFT’, the query completes within 1 second.

Upon inspecting the query plan, I noticed the difference in execution strategies. With the ‘ARCHIVED’ or ‘DRAFT’ status filter, the plan shows:

key: article_status_idx
Extra: Using index condition; Using filesort

But without the ‘ARCHIVED’ filter, the plan simply states:

key:
Extra: Using filesort

My question is: How can I optimize the query performance for filtering on ‘ARCHIVED’ status, ensuring it executes faster than the current 5.7 seconds, similar to the queries without this condition or with ‘DRAFT’ status?

3

Answers


  1. The behavior you experienced is actually what I would expect. writing 1,118,993 number-timestamp tuples into memory takes time, especially if you order the results by created_at. You need to test the following:

    SELECT 
        a.id AS id,
        a.created_at AS created_at
    FROM 
        article a
    ORDER BY a.created_at DESC
    LIMIT 0, 1118993;
    

    and if the time to execute this query is roughly the same as executing the query with the where clause, then it’s not the where clause that causes the performance drop, but ordering the query + loading the results into memory. Run the following as well:

    SELECT 
        a.id AS id,
        a.created_at AS created_at
    FROM 
        article a
    LIMIT 0, 1118993;
    

    Here we do not even have an order by. If this is similarly slow, then the thing you mainly wait for is to write this stuff into memory.

    Anyway, do these measurements and the result will either confirm that the slowness is caused by the where or order by clause, or not. If these clauses cause the slowness, then let me know in the comment section and I will provide ways to optimize that issue. But it’s highly probable that the real problem is that you are waiting for the full query to be executed.

    Maybe it’s better to break the query into partitions and run queries with limit 0, 10000 and then limit 10000, 10000 and so on, so you can work with the first results already while waiting for the others, maybe reducing the frustration of impatient users. But that does not change the fact that loading all this stuff takes time, even though you can make this waiting user-friendlier if the showing of partial results makes some useful sense.

    EDIT

    If all else fails, you could create a table like this:

    create table article_archive(
        id int primary key,
        created_at timestamp
    );
    
    insert into article_archive(id, created_at)
    select id, created_at
    from article
    where `status` = 'ARCHIVE';
    
    ALTER TABLE article_archive ADD INDEX (created_at DESC);
    

    And then you can select from article_archive, like:

    select id, created_at
    from article_archive
    order by created_at desc
    limit 50;
    

    You can even join that with article by id.

    Of course, you can try creating a (status, created_at) key before you create a copy table.

    Login or Signup to reply.
  2. Here’s an alternative approach that may improve speed, try it:

    SELECT 
        a.id AS id,
        a.created_at AS created_at
    FROM (
        SELECT id
        FROM article
        WHERE status = 'ARCHIVED'
        ORDER BY created_at DESC
        LIMIT 50
    ) AS subquery
    JOIN article a ON subquery.id = a.id;
    

    Explanation:
    This query utilizes a subquery to first identify the id values of the 50 most recent ‘ARCHIVED’ records based on the created_at column. It then performs a join to retrieve the additional columns (id and created_at) from the main table article. This approach can be beneficial when you want to optimize queries with complex filtering and sorting.

    Please ensure that the article_status_idx index on the status column and the idx_composite_search index on the created_at and id columns are well-maintained for efficient execution.

    If you go with this advice, don’t forget my score, go and enjoy yourself.

    Login or Signup to reply.
  3. You can create a composite index matching the query exactly:

    create index idx on article (status, created_at desc);
    

    Thus the DBMS can go to status = ‘ARCHIVED’ in the index, read the first 50 entries and be done with the task.

    https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search