skip to Main Content

Found Problem:

I was able to significantly reduce the response time to less than 0.01 seconds from 1.1 seconds simply by removing the ORDER BY clause.

I was doing a USORT in PHP on dates anyway since the post-processing would reorder the posts by type of match, so the ORDER BY clause was unnecessary.

The query described by Larry Lustig is very fast for integers. It was the ordering by a string dates which caused significant performance problems.

Hope this servers as a good example of how to apply conditions to multiple rows and why you want to watch out for database operations involving strings.

Original Question (Edited for Clarity):

I’m working on an indexed search with keyword stemming. I’m looking to optimize it more.

Larry Lustig’s answer helped me a lot here on how to create the query for easy scaling:

SQL for applying conditions to multiple rows in a join

Any suggestions of how to optimize this example query to run faster? I am guessing there could be a way to do one join for all the "s{x).post_id = p.ID" conditions.

SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
FROM wp_posts AS p
WHERE p.post_status = 'publish' AND ( ( ( 
    ( EXISTS ( SELECT s0.post_id FROM wp_rama_search_index AS s0 
        WHERE s0.post_id = p.ID AND s0.hash = -617801035 ) )
) OR ( 
    ( EXISTS ( SELECT s1.post_id FROM wp_rama_search_index AS s1 
            WHERE s1.post_id = p.ID AND s1.hash = 1805184399 ) 
        AND EXISTS ( SELECT s2.post_id FROM wp_rama_search_index AS s2 
            WHERE s2.post_id = p.ID AND s2.hash = 1823159221 ) 
        AND EXISTS ( SELECT s3.post_id FROM wp_rama_search_index AS s3 
            WHERE s3.post_id = p.ID AND s3.hash = 1692658528 ) ) 
) OR ( 
    ( EXISTS ( SELECT s4.post_id FROM wp_rama_search_index AS s4 
            WHERE s4.post_id = p.ID AND s4.hash = 332583789 ) ) 
) OR ( 
    ( EXISTS ( SELECT s5.post_id FROM wp_rama_search_index AS s5 
            WHERE s5.post_id = p.ID AND s5.hash = 623525713 ) ) 
) OR ( 
    ( EXISTS ( SELECT s6.post_id FROM wp_rama_search_index AS s6 
            WHERE s6.post_id = p.ID AND s6.hash = -2064050708 ) 
        AND EXISTS ( SELECT s7.post_id FROM wp_rama_search_index AS s7 
            WHERE s7.post_id = p.ID AND s7.hash = 1692658528 ) ) 
) OR ( 
    ( EXISTS ( SELECT s8.post_id FROM wp_rama_search_index AS s8 
            WHERE s8.post_id = p.ID AND s8.hash = 263456517 ) 
        AND EXISTS ( SELECT s9.post_id FROM wp_rama_search_index AS s9 
            WHERE s9.post_id = p.ID AND s9.hash = -1214274178 ) ) 
) OR ( 
    ( EXISTS ( SELECT s10.post_id FROM wp_rama_search_index AS s10 
            WHERE s10.post_id = p.ID AND s10.hash = -2064050708 ) 
        AND EXISTS ( SELECT s11.post_id FROM wp_rama_search_index AS s11 
            WHERE s11.post_id = p.ID AND s11.hash = -1864773421 ) ) 
) OR ( 
    ( EXISTS ( SELECT s12.post_id FROM wp_rama_search_index AS s12 
            WHERE s12.post_id = p.ID AND s12.hash = -1227797236 ) ) 
) OR ( 
    ( EXISTS ( SELECT s13.post_id FROM wp_rama_search_index AS s13 
            WHERE s13.post_id = p.ID AND s13.hash = 1823159221 ) 
        AND EXISTS ( SELECT s14.post_id FROM wp_rama_search_index AS s14 
            WHERE s14.post_id = p.ID AND s14.hash = -1214274178 ) ) 
) OR ( 
    ( EXISTS ( SELECT s15.post_id FROM wp_rama_search_index AS s15 
            WHERE s15.post_id = p.ID AND s15.hash = 323592937 ) ) 
) OR ( 
    ( EXISTS ( SELECT s16.post_id FROM wp_rama_search_index AS s16 
            WHERE s16.post_id = p.ID AND s16.hash = 322413837 ) ) 
) OR ( 
    ( EXISTS ( SELECT s17.post_id FROM wp_rama_search_index AS s17 
            WHERE s17.post_id = p.ID AND s17.hash = 472301092 ) ) ) ) ) 
ORDER BY p.post_date DESC

This query runs in about 1.1s from phpMyAdmin connecting to a large AWS Aurora database instance.

There are 35k published posts. Words in post titles and excerpts of post content are inflected and hashed using FVN1A32.

The "AND" operations represent phrases where both hashes must be matched. There are several hashes because keyword stemming is being used and where aliases of keywords can also be phrases.

2

Answers


  1. I think this would do the trick for you. It is basically a question of Relational Division, with multiple possible divisors.

    I must say, I’m not entirely familiar with MySQL so I may have some slight syntax errors. But I’m sure you will get the idea.

    Put the different search conditions in a temporary table, each group having a group number.

    We select all rows in our main table, where our temp table has a group for which the total number of hashes in that group is the same as the number of matches on those hashes. In other words, every requested hash in the group matches.

    CREATE TABLE #temp (grp int not null, hash int not null, primary key (grp, hash));
    
    SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
    FROM wp_posts AS p
    WHERE p.post_status = 'publish' AND 
        EXISTS ( SELECT 1
            FROM #temp t
            LEFT JOIN wp_rama_search_index AS s0
                ON s0.hash = t.hash AND s0.post_id = p.ID
            GROUP BY grp
            HAVING COUNT(*) = COUNT(s0.hash)
        );
    

    There are other ways to slice this, for example the following, which may be more efficient:

    SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date 
    FROM wp_posts AS p
    CROSS JOIN #temp t
    LEFT JOIN wp_rama_search_index AS s0
        ON s0.hash = t.hash AND s0.post_id = p.ID
    WHERE p.post_status = 'publish'
    GROUP BY p.ID    -- functional dependency??
    HAVING COUNT(*) = COUNT(s0.hash);
    

    See also these excellent articles on SimpleTalk: Divided We Stand: The SQL of Relational Division and High Performance Relational Division in SQL Server. The basic principles should be the same in MySQL.

    I would put an compound index on the temp table if you can, not sure which order.

    Login or Signup to reply.
  2. SELECT s15.post_id FROM wp_rama_search_index AS s15 
            WHERE s15.post_id = p.ID AND s15.hash = 323592937
    

    What is the real name of s15? That may give us some clues of how to improve the query. Also, what is hash about?

    Change to simply SELECT 1 FROM ...; there is not an advantage (and maybe a disadvantage) in listing any columns when doing EXISTS).

    This composite index, with the columns in either order, should help:

    INDEX(post_id, hash)
    

    OR is deadly to optimization. However, in your query, it may be advantageous to sort the EXISTS(...) clauses with the most likely ones first.

    This is WordPress? It may be that collecting the attributes that you are searching among into a single column (in wp_rama_search_index?), applying a FULLTEXT index to the column, then using MATCH(ft_column) AGAINST ("foo bar baz ..."). This may do most of the ORs without needing hashes and lots of EXISTS. (Again, I am guessing what is going on with s15andhash`.)

    Your preprocessing of the words would still be useful for synonyms, but unnecessary for inflections (since FULLTEXT handles that mostly). Phrases can be handled with quotes. That may obviate your current use of AND.

    If page-load takes 1.5 seconds aside from the SELECT, I suspect there are other things that need optimizing.

    danblock’s suggestion about replacing OR with IN works like this:

        EXISTS ( SELECT s15.post_id FROM wp_rama_search_index AS s15 
            WHERE s15.post_id = p.ID AND s15.hash = 323592937 
               ) 
    OR 
        EXISTS ( SELECT s16.post_id FROM wp_rama_search_index AS s16 
            WHERE s16.post_id = p.ID AND s16.hash = 322413837 
               ) 
    

    –>

    OR EXISTS( SELECT 1 FROM wp_rama_search_index
            WHERE post_id = p.ID
              AND hash IN ( 323592937, 322413837 )
    

    (LIMIT 1 is not needed.) My suggestion of FULLTEXT supersedes this.

    Using FULLTEXT will probably eliminate the speed variation due to having (or not) the ORDER BY.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search