I’ve been learning about optimizer hints in MySQL (my version is 8.0.36), and I’ve been able to experiment with them in very simplified scenarios and it works. However, as soon as I add a nested level, it all breaks down.
For example, assume I have 4 tables:
activities
:id
,rubric_id
,group_id
rubric_templates
:id
,title
rubric_sessions
:id
,template_id
,name
rubric_session_elements
:id
,rubric_session_id
,value
All ids have foreign keys established. There are many rubric session elements for a given rubric session, and many rubric sessions for a given rubric template. Many activities can use the same rubric, and there are many activities in a group. So no unique keys anywhere. Everything is a many-to-one.
If I instruct the optimizer to use DUPSWEEDOUT
or MATERIALIZATION
where there is only two tables, it works:
-- Get all rubric sessions for a group
EXPLAIN FORMAT=TREE
SELECT /*+ SEMIJOIN(@subq1 MATERIALIZATION) */ rs.*
FROM rubric_sessions rs
WHERE rs.template_id IN (
SELECT /*+ QB_NAME(subq1) */ a.rubric_template_id
FROM activities a
WHERE group_id = 123
);
/*
-> Nested loop inner join (cost=24.6 rows=48)
-> Filter: (`<subquery2>`.rubric_template_id is not null) (cost=0.00417..0.00417 rows=1)
-> Table scan on <subquery2> (cost=3.46..3.46 rows=1)
-> Materialize with deduplication (cost=0.951..0.951 rows=1)
-> Filter: (a.rubric_template_id is not null) (cost=0.851 rows=1)
-> Index lookup on a using group_id (group_id=123) (cost=0.851 rows=1)
-> Index lookup on rs using template_id (template_id=`<subquery2>`.rubric_template_id), with index condition: (rs.template_id = `<subquery2>`.rubric_template_id) (cost=24.6 rows=48)
*/
-- Replace MATERIALIZATION with DUPSWEEDOUT changes the plan to this:
/*
-> Remove duplicate rs rows using temporary table (weedout) (cost=25.4 rows=48)
-> Nested loop inner join (cost=25.4 rows=48)
-> Filter: (a.rubric_template_id is not null) (cost=0.851 rows=1)
-> Index lookup on a using group_id (group_id=123) (cost=0.851 rows=1)
-> Index lookup on rs using template_id (template_id=a.rubric_template_id), with index condition: (rs.template_id = a.rubric_template_id) (cost=24.5 rows=48)
*/
However if I want to take it a step further, it will only ever do weedout. It refuses to do materialization.
-- Get all rubric session elements for a group
EXPLAIN FORMAT=TREE
SELECT /*+ SEMIJOIN(@subq1 MATERIALIZATION) */ rse.*
FROM rubric_session_elements rse
WHERE rse.rubric_session_id IN (
SELECT rs.id
FROM rubric_sessions rs
WHERE rs.template_id IN (
SELECT /*+ QB_NAME(subq1) */ a.rubric_template_id
FROM activities a
WHERE group_id = 123
)
);
/*
-> Remove duplicate (rs, rse) rows using temporary table (weedout) (cost=527 rows=572)
-> Nested loop inner join (cost=527 rows=572)
-> Nested loop inner join (cost=5.92 rows=48)
-> Filter: (a.rubric_template_id is not null) (cost=0.851 rows=1)
-> Index lookup on a using group_id (group_id=123) (cost=0.851 rows=1)
-> Filter: (rs.template_id = a.rubric_template_id) (cost=5.07 rows=48)
-> Covering index lookup on rs using template_id (template_id=a.rubric_template_id) (cost=5.07 rows=48)
-> Index lookup on rse using rubric_session_elements_rubric_session_id_foreign (rubric_session_id=rs.id) (cost=9.68 rows=11.9)
*/
No matter how I change it, I cannot convince the optimizer to use MATERIALIZATION. I’ve put the semijoin hint on both outer and middle selects. I’ve put QB_NAME on both middle and inner selects. I even tried:
SET SESSION optimizer_switch = 'duplicateweedout=off';
but it still comes out as "weedout". If I specify NO_SEMIJOIN(DUPSWEEDOUT)
then it seems to revert to hash join, but that’s not MATERIALIZATION. It doesn’t look like I violated in the rules specified in https://dev.mysql.com/doc/refman/8.4/en/subquery-materialization.html, so what am I doing wrong?
2
Answers
I found that if I put the subquery into a view with ALGORITHM=TEMPTABLE, then it would materialize like I expect. That got me thinking that maybe I can achieve the same using Common Table Expressions. Which lead me to the NO_MERGE optimization:
By extracting the subquery into a CTE, then adding the NO_MERGE() hint, I was able to get the optimizer to materialize the subquery, but not do a full table scan on rse:
Which in my scenario, where there is a lot of duplicate rubric template ids, greatly sped up the results.
MYSQL decides the relevant strategy: MATERIALIZATION (or) DUPSWEEDOUT based on the cost even if we force it (according to MySQL manuals for SEMIJOIN Hint). On the other hand, there is another Hint SUBQUERY, which you can use since nested subqueries are present in your query, which uses the materialization strategy.
The modified code below will use the MATERIALIZATION strategy:
Please let me know if you need more information.