I’m working on a project with a MySQL database. I have some CTEs that summarize features of the users (each CTE summarizes one feature). Then I have a last query that joins all other CTEs on a column named user_id.
The following example describes this well (note that the actual problem is more complex and this is just a way to explain it in a clear way):
CREATE TABLE users (
user_id int(11) NOT NULL AUTO_INCREMENT,
name varchar(255),
PRIMARY KEY (user_id)
);
CREATE TABLE scores (
user_id int(11) NOT NULL auto_increment,
score int(3) default NULL,
CONSTRAINT `users_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
);
with
avg_score_by_user as (
select
user_id,
avg(score) as avg_score
from
scores
group by user_id
),
max_score_by_user as (
select
user_id,
max(score) as max_score
from
scores
group by user_id
)
select
users.user_id,
avg_score_by_user.avg_score,
max_score_by_user.max_score
from
users
join
avg_score_by_user
on
avg_score_by_user.user_id = users.user_id
join
max_score_by_user
on
max_score_by_user.user_id = users.user_id;
In my use case, I use this to query the features of a specific user. To do that I do where user_id=1
(or some other id) at the end of the query.
The problem is that this outer where (user_id=1) does not propagate to all joined CTEs. I haven’t found a proof in MySQL’s docs but according to some sources (https://www.oreilly.com/library/view/high-performance-mysql/9780596101718/ch04.html under "Equality propagation") and some Googling, the where user_id=1
condition should propagate to max_score_by_user
and avg_score_by_user
so that the filtering happens early on those CTEs instead of happening on the whole result set.
To clarify this problem, we can look at the explain
of the query:
You can see that both CTEs (the DERIVED tables) scan all 6 rows where they should scan only 3 each as these are the rows corresponding to user_id=1
.
Interestingly, if we filter using where avg_score_by_user.user_id=1
, the optimization does happen and avg_score_by_user
scans only 3 rows but max _score_by_user
still scans 6 rows:
Explain of the CTE query with different where condition
Bottom line, I would want the where users.user_id=1
to happen early in each CTE so that each CTE scans only the relevant rows for that user.
I tried using the using(user_id)
instead of the on
clause but still did not work.
3
Answers
You can add the
WHERE
condition to the CTEs. But, in your case, the CTEs are not needed at all. You will get the same result with:If your query is more complex, and shown query is a pattern, then you may do the next trick:
Alternatively you may assign your parameter value into the user-defined variable in the most first CTE then use it in WHERE instead of CTE joining.
Start with a "derived table", then
JOIN
tousers
to get other data.