skip to Main Content

I have a query like this:

EXPLAIN SELECT b.* FROM bills b WHERE
(b.id IN (SELECT u.id FROM bills u WHERE u.updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40' AND u.organisation_id = 'ABC123')
OR b.id IN (SELECT c.id FROM bills c WHERE c.created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40' AND c.organisation_id = 'ABC123'))

but the result is

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t0 NULL ALL NULL NULL NULL NULL 10402901 100 Using where
3 SUBQUERY c NULL range PRIMARY,organisation_created organisation_created 41 NULL 1 100 Using where; Using index
2 SUBQUERY u NULL range PRIMARY,updated,organisation_created updated 5 NULL 4 1.87 Using index condition; Using where

can anyone help me understand why the PRIMARY index (which is only on id) is not being used in the first row? I tried forcing that index to be used with FORCE INDEX (PRIMARY) but it still wasn’t used. Also, is it possible to re-write this as a single query that will run faster with JPA (so UNION is unfortunately not available)?

Edit:
It seems I should have mentioned that I’m currently re-writing this as a more complex version of

SELECT b.* FROM bills b WHERE
(b.updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
OR c.created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40')
AND c.organisation_id = 'ABC123';

Second edit:
It seems I should have also mentioned that I have indexes on (updated) and (organisation_id,created). I thought these weren’t relevant because my sub queries are using them correctly, but in the previous version of the query, because of the OR condition on two inequalities (< / >) on different columns, MySQL was unable to make use of the two indexes. It only used one and then performed a either a table scan or a full lookup for all entries with the same organisation_id for the other.

4

Answers


  1. Unfortunately, query expressions with OR are hard to optimize.

    Think of a telephone book. It’s like an index defined on (last_name, first_name) in that order.

    If I search for last_name='Karwin' the index benefits the search.

    If I search only for first_name='Bill' the index does not help, because the index is not sorted by first name. Many entries have the first name ‘Bill’ and they can be anywhere. The index can never be used, even if I use FORCE INDEX.

    If I had a second telephone book organized by (first_name, last_name) then my search for first_name='Bill' could perform the search on the first column of that index efficiently. But a search only on last name could not use this index.

    If I want to search for first_name='Bill' AND last_name='Karwin' then either index would help. It would search the first column efficiently, then among the matches, it would search the second column, also efficiently.

    But if I want to search for first_name='Bill' OR last_name='Karwin' then neither index can do both. And if we assume that only one index can be used at a time, then we’re stuck. It’s bound to require reading the telephone book cover-to-cover either way.

    This is why FORCE INDEX is not solving this problem. It does no good to insist, when the index simply can’t do the job. That’s not what FORCE INDEX means anyway. It only means that the optimizer should consider a table-scan as a more costly choice than the named index, so it would prefer to use the index if possible.

    The standard solution is to break up the query that uses an OR expression into two queries, then each query can use a different index, and with UNION the results of each query are recombined.

    But standard JPA doesn’t support UNION*
    So you must use @Query( ..., nativeQuery = true ) if you need UNION.


    * I have read that EclipseLink JPA does support UNION, cf. https://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Querying/JPQL#UNION But the documentation is pretty thin, and apparently their wiki is going to be shut down. I’d be reluctant to use EclipseLink, it seems half-baked.

    Login or Signup to reply.
  2. Query can be simplified because id is a PK and you selecting ids from the same table, you do not need subqueries here.

    SELECT b.* FROM bills b 
     WHERE
    
     (b.updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
     OR 
     b.created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
     ) AND b.organisation_id = 'ABC123'
    

    Try indexing by created, updated and organisation_id columns.

    Definitely index by organisation_id will be the most efficient

    Login or Signup to reply.
  3. Your query is overly complicated. Why select the IDs from the table rows and then use these IDs to access the same table rows? This doesn’t seem to make sense.

    Instead use OR:

    WHERE organisation_id = 'ABC123'
    AND ( (created ...) OR (updated ...) )
    

    The full query can look like this:

    SELECT * 
    FROM bills
    WHERE organisation_id = 'ABC123'
    AND
    (
      (
        updated >= TIMESTAMP '2023-12-19 14:17:48' 
        AND 
        updated <  TIMESTAMP '2023-12-19 14:27:41'
      )
     OR
      (
        created >= TIMESTAMP '2023-12-19 14:17:48' 
        AND
        created <  TIMESTAMP '2023-12-19 14:27:41'
      )
    );
    

    The problem with this: Do we want an index on (organisation_id, created) or on (organisation_id, updated)? Or have both timestamps in the index, but if so in which order?

    Ideally we would create the first two mentioned indexes and the DBMS would use both.

    CREATE INDEX idx1 ON bills (organisation_id, created);
    CREATE INDEX idx2 ON bills (organisation_id, updated);
    

    But often the DBMS optimzer doesn’t see this simple solution. We can help it see the task in a different light:

    SELECT * 
    FROM bills
    WHERE organisation_id = 'ABC123'
    AND updated >= TIMESTAMP '2023-12-19 14:17:48' 
    AND updated <  TIMESTAMP '2023-12-19 14:27:41'
    UNION 
    SELECT * 
    FROM bills
    WHERE organisation_id = 'ABC123'
    AND created >= TIMESTAMP '2023-12-19 14:17:48' 
    AND created <  TIMESTAMP '2023-12-19 14:27:41';
    
    Login or Signup to reply.
  4. Why not using PK?

    The only way MySQL knows to perform

    SELECT ...
        WHERE (...) OR (...)
    

    is to do a ‘full table scan’. Your attempt to get the id values and then load * was thwarted by OR. So, let’s work on avoiding that formulation.

    Without UNION — As leftjoin says,

    SELECT  *
        FROM  bills
        WHERE organisation_id = 'ABC123')
          AND (   created BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
               OR updated BETWEEN '2023-12-19 14:17:48' AND '2023-12-19 14:27:40'
              )
    

    And have

    INDEX(organisation_id)
    

    That formulation will use that one query reasonably efficiently. Having two indexes (and changing the query formulation) is unlikely to help performance.

    This may help a little:

    INDEX(organisation_id, created, updated)
    

    With UNION — Depending of the distribution of the data, Thorsten’s UNION DISTINCT and two indexes may be faster, but I doubt it. Note that UNION (aka UNION DISTINCT) must store intermediate results to a file, then de-duplicate.

    That is, the query benefit from the two indexes can be used, but the extra overhead of UNION DISTINCT may wipe out the benefits.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search