Okay, I’m absolutely flummoxed trying to understand what’s going on here.
Everywhere I’m reading it is suggested to use MAX()/MIN() on a data column with a GROUP BY in order to get the highest/lowest value for each group of data. Yet every attempt I try to retrieve the data in such a way, it returns the max/min of one column, but then displays that data with the ID from a different row?!
I’ve distilled my query below to the simplest level, just the subscript that I’m trying to JOIN later with another larger query:
SELECT studentID, studentCourseID, MAX(dateCreated) AS dateCreated
FROM courseAttempts
JOIN courses ON courseID = studentCourseID AND courseActive = 1
WHERE studentID IN (100,101,102,103) AND studentBlocked = 0
GROUP BY studentCourseID
Below is the data that the script is able to retrieve without the use of MAX() or GROUP BY
studentID | courseID | dateCreated |
---|---|---|
100 | 39 | 2023-09-07 02:48:46.000 |
102 | 39 | 2023-12-24 02:57:07.000 |
If I run the script above with the MAX() and GROUP BY I get:
studentID | courseID | dateCreated |
---|---|---|
100 | 39 | 2023-12-24 02:57:07.000 |
That’s the right date, but the wrong ID. That’s malformed unusable data.
How is this supposed to work?
It should return
studentID | courseID | dateCreated |
---|---|---|
102 | 39 | 2023-12-24 02:57:07.000 |
2
Answers
Use a windowing function:
In your query,
studentID
in the SELECT list is neither an aggregated column nor a column stated in the GROUP BY clause. This is NOT allowed in the earlier SQL standards (SQL-92 and before) . However, SQL-1999 and later permits such nonaggregated column , ON ONE CONDITION, which is the non aggregated column MUST be functionally dependent on the columns in the GROUP BY clause. See: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.htmlFor example, in an e-commerce platform, a user_id is unique. For every user_id, there is a unique user_name associated to it.(Note: the relationship should be defined in table structure, such as using PK/UNIQUE key) In this case, the user_name is functionally dependent on the user_id. Under such circumstance, if you want to get the total spending by each user_id, it’s legit to put user_name in the SELECT list without defining it in the GROUP BY clause:
But in your case, from the data without the use of MAX() or GROUP BY,
we can already see it is possible for a
studentCourseID
to have more than onestudentID
. Therefore,studentID
is not functionally dependent on studentCourseID.(The reason that your query does not raise an error is becauseONLY_FULL_GROUP_BY
is not defined in the system variablesql_mode
. ) So you should either put it in the GROUP BY list, or omit it from the SELECT list. Otherwise, how can MySQL know which studentID to return ? (Especially in the case of sum() ,when every studentID from a studentCourseID contributes to the aggregate value)Your might think that MySQL should just automatically return the studentID which has the latest date for a courseID. Unfortunately, it does not work that way. We have to do some extra work. Here are the two steps to do the job:
First of all, we need to get the max date for each studentCourseID:
Then, we perform a JOIN on our original table and the aggregate table from previous step: