skip to Main Content

I’m working on a project with a MySQL database. I have some CTEs that summarize features of the users (each CTE summarizes one feature). Then I have a last query that joins all other CTEs on a column named user_id.

The following example describes this well (note that the actual problem is more complex and this is just a way to explain it in a clear way):

CREATE TABLE users (
    user_id int(11) NOT NULL AUTO_INCREMENT,
    name varchar(255),
    PRIMARY KEY (user_id)
);

CREATE  TABLE scores (
   user_id int(11) NOT NULL auto_increment,
   score int(3) default NULL,
   CONSTRAINT `users_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
);

users table

scores table

with
    avg_score_by_user as (
        select 
            user_id,
            avg(score) as avg_score
        from
            scores
        group by user_id
    ),
    
    max_score_by_user as (
        select
            user_id,
            max(score) as max_score
        from
            scores
        group by user_id
    )
    
    select
        users.user_id,
        avg_score_by_user.avg_score,
        max_score_by_user.max_score
    from
        users
    join
        avg_score_by_user
    on
        avg_score_by_user.user_id = users.user_id
    join
        max_score_by_user
    on
        max_score_by_user.user_id = users.user_id;

In my use case, I use this to query the features of a specific user. To do that I do where user_id=1 (or some other id) at the end of the query.

The problem is that this outer where (user_id=1) does not propagate to all joined CTEs. I haven’t found a proof in MySQL’s docs but according to some sources (https://www.oreilly.com/library/view/high-performance-mysql/9780596101718/ch04.html under "Equality propagation") and some Googling, the where user_id=1 condition should propagate to max_score_by_user and avg_score_by_user so that the filtering happens early on those CTEs instead of happening on the whole result set.

To clarify this problem, we can look at the explain of the query:

Explain of the CTE query

You can see that both CTEs (the DERIVED tables) scan all 6 rows where they should scan only 3 each as these are the rows corresponding to user_id=1.

Interestingly, if we filter using where avg_score_by_user.user_id=1, the optimization does happen and avg_score_by_user scans only 3 rows but max _score_by_user still scans 6 rows:

Explain of the CTE query with different where condition

Bottom line, I would want the where users.user_id=1 to happen early in each CTE so that each CTE scans only the relevant rows for that user.

I tried using the using(user_id) instead of the on clause but still did not work.

3

Answers


  1. You can add the WHERE condition to the CTEs. But, in your case, the CTEs are not needed at all. You will get the same result with:

    select
        s.user_id,
        avg(s.score) as avg_score,
        max(s.score) as max_score,
    from scores s
    group by s.user_id
    
    Login or Signup to reply.
  2. In my use case, I use this to query the features of a specific user. To do that I do where user_id=1 (or some other id) at the end of the query.

    SELECT user_id,
           AVG(scores.score) AS avg_score,
           MAX(scores.score) AS max_score
    FROM users
    JOIN scores USING (user_id)
    WHERE user_id = 1
    -- or
    -- WHERE user_id IN ( {user_id list} )
    -- GROUP BY user_id
    ;
    

    If your query is more complex, and shown query is a pattern, then you may do the next trick:

    WITH 
    parameter AS (
        SELECT {needed user_id} AS user_id
        ),
    cte1 AS (
        SELECT ...
        FROM ...
        JOIN parameter USING (user_id)
        ),
    cte2 AS (
        SELECT ...
        FROM ...
        JOIN parameter USING (user_id)
        ),
    ...
    
    SELECT ...
    FROM ...
    JOIN cte1 USING (user_id)
    JOIN cte2 USING (user_id)
    ...
    

    Alternatively you may assign your parameter value into the user-defined variable in the most first CTE then use it in WHERE instead of CTE joining.

    WITH 
    cte1 AS (
        SELECT ...
        FROM ...
        WHERE user_id = ( @user_id := {needed user_id} )
        ),
    cte2 AS (
        SELECT ...
        FROM ...
        WHERE user_id = @user_id
        ),
    ...
    
    SELECT ...
    FROM ...
    JOIN cte1 USING (user_id)
    JOIN cte2 USING (user_id)
    ...
    
    Login or Signup to reply.
  3. Start with a "derived table", then JOIN to users to get other data.

    SELECT  users.user_id,
            users.name
        FROM (
           select
               s.user_id,
               avg(s.score) as avg_score,
               max(s.score) as max_score,
           from scores s
           group by s.user_id
              ) AS agg
        JOIN users  ON agg.user_id = users.user_id;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search