skip to Main Content

Okay, I’m absolutely flummoxed trying to understand what’s going on here.

Everywhere I’m reading it is suggested to use MAX()/MIN() on a data column with a GROUP BY in order to get the highest/lowest value for each group of data. Yet every attempt I try to retrieve the data in such a way, it returns the max/min of one column, but then displays that data with the ID from a different row?!

I’ve distilled my query below to the simplest level, just the subscript that I’m trying to JOIN later with another larger query:

SELECT studentID, studentCourseID, MAX(dateCreated) AS dateCreated
FROM courseAttempts
JOIN courses ON courseID = studentCourseID AND courseActive = 1 
WHERE studentID IN (100,101,102,103) AND studentBlocked = 0
GROUP BY studentCourseID

Below is the data that the script is able to retrieve without the use of MAX() or GROUP BY

studentID courseID dateCreated
100 39 2023-09-07 02:48:46.000
102 39 2023-12-24 02:57:07.000

If I run the script above with the MAX() and GROUP BY I get:

studentID courseID dateCreated
100 39 2023-12-24 02:57:07.000

That’s the right date, but the wrong ID. That’s malformed unusable data.
How is this supposed to work?

It should return

studentID courseID dateCreated
102 39 2023-12-24 02:57:07.000

2

Answers


  1. Use a windowing function:

    WITH cte AS (
      SELECT
        studentID,
        studentCourseID,
        dateCreated, 
        ROW_NUMBER() OVER (PARTITION BY studentCourseID ORDER BY dateCreated DESC) AS row_num
      FROM courseAttempts
      JOIN courses ON courseID = studentCourseID AND courseActive = 1 
      WHERE studentID IN (100,101,102,103) AND studentBlocked = 0
    )
    SELECT * FROM cte WHERE row_num = 1
    
    Login or Signup to reply.
  2. In your query, studentID in the SELECT list is neither an aggregated column nor a column stated in the GROUP BY clause. This is NOT allowed in the earlier SQL standards (SQL-92 and before) . However, SQL-1999 and later permits such nonaggregated column , ON ONE CONDITION, which is the non aggregated column MUST be functionally dependent on the columns in the GROUP BY clause. See: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

    For example, in an e-commerce platform, a user_id is unique. For every user_id, there is a unique user_name associated to it.(Note: the relationship should be defined in table structure, such as using PK/UNIQUE key) In this case, the user_name is functionally dependent on the user_id. Under such circumstance, if you want to get the total spending by each user_id, it’s legit to put user_name in the SELECT list without defining it in the GROUP BY clause:

    select user_name , user_id , sum(spending)
    ...
    group by user_id
    ;
    

    But in your case, from the data without the use of MAX() or GROUP BY,

    studentID   courseID    dateCreated
    100         39          2023-09-07 02:48:46.000
    102         39          2023-12-24 02:57:07.000
    

    we can already see it is possible for a studentCourseID to have more than one studentID. Therefore, studentID is not functionally dependent on studentCourseID.(The reason that your query does not raise an error is because ONLY_FULL_GROUP_BY is not defined in the system variable sql_mode. ) So you should either put it in the GROUP BY list, or omit it from the SELECT list. Otherwise, how can MySQL know which studentID to return ? (Especially in the case of sum() ,when every studentID from a studentCourseID contributes to the aggregate value)

    Your might think that MySQL should just automatically return the studentID which has the latest date for a courseID. Unfortunately, it does not work that way. We have to do some extra work. Here are the two steps to do the job:

    First of all, we need to get the max date for each studentCourseID:

    select studentCourseID, max(dateCreated) as maxdate
    from derived_table
    ;
    

    Then, we perform a JOIN on our original table and the aggregate table from previous step:

    select t1.studentID, t1.studentCourseID, t1.dateCreated
    from derived_table t1
    join (select studentCourseID, max(dateCreated) as maxdate
         from derived_table) t2
    on t1.studentCourseID=t2.studentCourseID and t1.dateCreated=t2.maxdate
    ;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search