I have a query like this:
EXPLAIN SELECT b.* FROM bills b WHERE
(b.id IN (SELECT u.id FROM bills u WHERE u.updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40' AND u.organisation_id = 'ABC123')
OR b.id IN (SELECT c.id FROM bills c WHERE c.created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40' AND c.organisation_id = 'ABC123'))
but the result is
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | t0 | NULL | ALL | NULL | NULL | NULL | NULL | 10402901 | 100 | Using where |
3 | SUBQUERY | c | NULL | range | PRIMARY,organisation_created | organisation_created | 41 | NULL | 1 | 100 | Using where; Using index |
2 | SUBQUERY | u | NULL | range | PRIMARY,updated,organisation_created | updated | 5 | NULL | 4 | 1.87 | Using index condition; Using where |
can anyone help me understand why the PRIMARY
index (which is only on id
) is not being used in the first row? I tried forcing that index to be used with FORCE INDEX (PRIMARY)
but it still wasn’t used. Also, is it possible to re-write this as a single query that will run faster with JPA (so UNION
is unfortunately not available)?
Edit:
It seems I should have mentioned that I’m currently re-writing this as a more complex version of
SELECT b.* FROM bills b WHERE
(b.updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
OR c.created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40')
AND c.organisation_id = 'ABC123';
Second edit:
It seems I should have also mentioned that I have indexes on (updated)
and (organisation_id,created)
. I thought these weren’t relevant because my sub queries are using them correctly, but in the previous version of the query, because of the OR
condition on two inequalities (<
/ >
) on different columns, MySQL was unable to make use of the two indexes. It only used one and then performed a either a table scan or a full lookup for all entries with the same organisation_id
for the other.
4
Answers
Unfortunately, query expressions with
OR
are hard to optimize.Think of a telephone book. It’s like an index defined on
(last_name, first_name)
in that order.If I search for
last_name='Karwin'
the index benefits the search.If I search only for
first_name='Bill'
the index does not help, because the index is not sorted by first name. Many entries have the first name ‘Bill’ and they can be anywhere. The index can never be used, even if I use FORCE INDEX.If I had a second telephone book organized by
(first_name, last_name)
then my search forfirst_name='Bill'
could perform the search on the first column of that index efficiently. But a search only on last name could not use this index.If I want to search for
first_name='Bill' AND last_name='Karwin'
then either index would help. It would search the first column efficiently, then among the matches, it would search the second column, also efficiently.But if I want to search for
first_name='Bill' OR last_name='Karwin'
then neither index can do both. And if we assume that only one index can be used at a time, then we’re stuck. It’s bound to require reading the telephone book cover-to-cover either way.This is why FORCE INDEX is not solving this problem. It does no good to insist, when the index simply can’t do the job. That’s not what FORCE INDEX means anyway. It only means that the optimizer should consider a table-scan as a more costly choice than the named index, so it would prefer to use the index if possible.
The standard solution is to break up the query that uses an
OR
expression into two queries, then each query can use a different index, and withUNION
the results of each query are recombined.But standard JPA doesn’t support
UNION
*So you must use
@Query( ..., nativeQuery = true )
if you needUNION
.* I have read that EclipseLink JPA does support
UNION
, cf. https://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Querying/JPQL#UNION But the documentation is pretty thin, and apparently their wiki is going to be shut down. I’d be reluctant to use EclipseLink, it seems half-baked.Query can be simplified because id is a PK and you selecting ids from the same table, you do not need subqueries here.
Try indexing by created, updated and organisation_id columns.
Definitely index by organisation_id will be the most efficient
Your query is overly complicated. Why select the IDs from the table rows and then use these IDs to access the same table rows? This doesn’t seem to make sense.
Instead use
OR
:The full query can look like this:
The problem with this: Do we want an index on (organisation_id, created) or on (organisation_id, updated)? Or have both timestamps in the index, but if so in which order?
Ideally we would create the first two mentioned indexes and the DBMS would use both.
But often the DBMS optimzer doesn’t see this simple solution. We can help it see the task in a different light:
Why not using PK?
The only way MySQL knows to perform
is to do a ‘full table scan’. Your attempt to get the
id
values and then load*
was thwarted byOR
. So, let’s work on avoiding that formulation.Without UNION — As leftjoin says,
And have
That formulation will use that one query reasonably efficiently. Having two indexes (and changing the query formulation) is unlikely to help performance.
This may help a little:
With UNION — Depending of the distribution of the data, Thorsten’s
UNION DISTINCT
and two indexes may be faster, but I doubt it. Note thatUNION
(akaUNION DISTINCT
) must store intermediate results to a file, then de-duplicate.That is, the query benefit from the two indexes can be used, but the extra overhead of
UNION DISTINCT
may wipe out the benefits.