Found Problem:
I was able to significantly reduce the response time to less than 0.01 seconds from 1.1 seconds simply by removing the ORDER BY clause.
I was doing a USORT in PHP on dates anyway since the post-processing would reorder the posts by type of match, so the ORDER BY clause was unnecessary.
The query described by Larry Lustig is very fast for integers. It was the ordering by a string dates which caused significant performance problems.
Hope this servers as a good example of how to apply conditions to multiple rows and why you want to watch out for database operations involving strings.
Original Question (Edited for Clarity):
I’m working on an indexed search with keyword stemming. I’m looking to optimize it more.
Larry Lustig’s answer helped me a lot here on how to create the query for easy scaling:
SQL for applying conditions to multiple rows in a join
Any suggestions of how to optimize this example query to run faster? I am guessing there could be a way to do one join for all the "s{x).post_id = p.ID" conditions.
SELECT p.ID, p.post_title, LEFT ( p.post_content, 800 ), p.post_date
FROM wp_posts AS p
WHERE p.post_status = 'publish' AND ( ( (
( EXISTS ( SELECT s0.post_id FROM wp_rama_search_index AS s0
WHERE s0.post_id = p.ID AND s0.hash = -617801035 ) )
) OR (
( EXISTS ( SELECT s1.post_id FROM wp_rama_search_index AS s1
WHERE s1.post_id = p.ID AND s1.hash = 1805184399 )
AND EXISTS ( SELECT s2.post_id FROM wp_rama_search_index AS s2
WHERE s2.post_id = p.ID AND s2.hash = 1823159221 )
AND EXISTS ( SELECT s3.post_id FROM wp_rama_search_index AS s3
WHERE s3.post_id = p.ID AND s3.hash = 1692658528 ) )
) OR (
( EXISTS ( SELECT s4.post_id FROM wp_rama_search_index AS s4
WHERE s4.post_id = p.ID AND s4.hash = 332583789 ) )
) OR (
( EXISTS ( SELECT s5.post_id FROM wp_rama_search_index AS s5
WHERE s5.post_id = p.ID AND s5.hash = 623525713 ) )
) OR (
( EXISTS ( SELECT s6.post_id FROM wp_rama_search_index AS s6
WHERE s6.post_id = p.ID AND s6.hash = -2064050708 )
AND EXISTS ( SELECT s7.post_id FROM wp_rama_search_index AS s7
WHERE s7.post_id = p.ID AND s7.hash = 1692658528 ) )
) OR (
( EXISTS ( SELECT s8.post_id FROM wp_rama_search_index AS s8
WHERE s8.post_id = p.ID AND s8.hash = 263456517 )
AND EXISTS ( SELECT s9.post_id FROM wp_rama_search_index AS s9
WHERE s9.post_id = p.ID AND s9.hash = -1214274178 ) )
) OR (
( EXISTS ( SELECT s10.post_id FROM wp_rama_search_index AS s10
WHERE s10.post_id = p.ID AND s10.hash = -2064050708 )
AND EXISTS ( SELECT s11.post_id FROM wp_rama_search_index AS s11
WHERE s11.post_id = p.ID AND s11.hash = -1864773421 ) )
) OR (
( EXISTS ( SELECT s12.post_id FROM wp_rama_search_index AS s12
WHERE s12.post_id = p.ID AND s12.hash = -1227797236 ) )
) OR (
( EXISTS ( SELECT s13.post_id FROM wp_rama_search_index AS s13
WHERE s13.post_id = p.ID AND s13.hash = 1823159221 )
AND EXISTS ( SELECT s14.post_id FROM wp_rama_search_index AS s14
WHERE s14.post_id = p.ID AND s14.hash = -1214274178 ) )
) OR (
( EXISTS ( SELECT s15.post_id FROM wp_rama_search_index AS s15
WHERE s15.post_id = p.ID AND s15.hash = 323592937 ) )
) OR (
( EXISTS ( SELECT s16.post_id FROM wp_rama_search_index AS s16
WHERE s16.post_id = p.ID AND s16.hash = 322413837 ) )
) OR (
( EXISTS ( SELECT s17.post_id FROM wp_rama_search_index AS s17
WHERE s17.post_id = p.ID AND s17.hash = 472301092 ) ) ) ) )
ORDER BY p.post_date DESC
This query runs in about 1.1s from phpMyAdmin connecting to a large AWS Aurora database instance.
There are 35k published posts. Words in post titles and excerpts of post content are inflected and hashed using FVN1A32.
The "AND" operations represent phrases where both hashes must be matched. There are several hashes because keyword stemming is being used and where aliases of keywords can also be phrases.
2
Answers
I think this would do the trick for you. It is basically a question of Relational Division, with multiple possible divisors.
I must say, I’m not entirely familiar with MySQL so I may have some slight syntax errors. But I’m sure you will get the idea.
Put the different search conditions in a temporary table, each group having a group number.
We select all rows in our main table, where our temp table has a group for which the total number of hashes in that group is the same as the number of matches on those hashes. In other words, every requested hash in the group matches.
There are other ways to slice this, for example the following, which may be more efficient:
See also these excellent articles on SimpleTalk: Divided We Stand: The SQL of Relational Division and High Performance Relational Division in SQL Server. The basic principles should be the same in MySQL.
I would put an compound index on the temp table if you can, not sure which order.
What is the real name of
s15
? That may give us some clues of how to improve the query. Also, what ishash
about?Change to simply
SELECT 1 FROM ...
; there is not an advantage (and maybe a disadvantage) in listing any columns when doingEXISTS
).This composite index, with the columns in either order, should help:
OR
is deadly to optimization. However, in your query, it may be advantageous to sort theEXISTS(...)
clauses with the most likely ones first.This is WordPress? It may be that collecting the attributes that you are searching among into a single column (in
wp_rama_search_index
?), applying aFULLTEXT
index to the column, then usingMATCH(ft_column) AGAINST ("foo bar baz ..."). This may do most of the ORs without needing hashes and lots of
EXISTS. (Again, I am guessing what is going on with
s15and
hash`.)Your preprocessing of the words would still be useful for synonyms, but unnecessary for inflections (since FULLTEXT handles that mostly). Phrases can be handled with quotes. That may obviate your current use of
AND
.If page-load takes 1.5 seconds aside from the
SELECT
, I suspect there are other things that need optimizing.danblock’s suggestion about replacing
OR
withIN
works like this:–>
(
LIMIT 1
is not needed.) My suggestion ofFULLTEXT
supersedes this.Using FULLTEXT will probably eliminate the speed variation due to having (or not) the
ORDER BY
.