MySQL Sorting Using MAX() For GROUP BY Is Returning Malformed Data

BenjaminWKI
May 20, 2024
132 views
0 votes
2 Answers

Okay, I’m absolutely flummoxed trying to understand what’s going on here.

Everywhere I’m reading it is suggested to use MAX()/MIN() on a data column with a GROUP BY in order to get the highest/lowest value for each group of data. Yet every attempt I try to retrieve the data in such a way, it returns the max/min of one column, but then displays that data with the ID from a different row?!

I’ve distilled my query below to the simplest level, just the subscript that I’m trying to JOIN later with another larger query:

SELECT studentID, studentCourseID, MAX(dateCreated) AS dateCreated
FROM courseAttempts
JOIN courses ON courseID = studentCourseID AND courseActive = 1 
WHERE studentID IN (100,101,102,103) AND studentBlocked = 0
GROUP BY studentCourseID

Below is the data that the script is able to retrieve without the use of MAX() or GROUP BY

studentID	courseID	dateCreated
100	39	2023-09-07 02:48:46.000
102	39	2023-12-24 02:57:07.000

If I run the script above with the MAX() and GROUP BY I get:

studentID	courseID	dateCreated
100	39	2023-12-24 02:57:07.000

That’s the right date, but the wrong ID. That’s malformed unusable data.
How is this supposed to work?

It should return

studentID	courseID	dateCreated
102	39	2023-12-24 02:57:07.000

Tags: database mysql

Answers

Use a windowing function:

WITH cte AS (
  SELECT
    studentID,
    studentCourseID,
    dateCreated, 
    ROW_NUMBER() OVER (PARTITION BY studentCourseID ORDER BY dateCreated DESC) AS row_num
  FROM courseAttempts
  JOIN courses ON courseID = studentCourseID AND courseActive = 1 
  WHERE studentID IN (100,101,102,103) AND studentBlocked = 0
)
SELECT * FROM cte WHERE row_num = 1

- blabla_bingo
- May 20, 2024 at 6:19 am
- 0 votes
0
In your query, studentID in the SELECT list is neither an aggregated column nor a column stated in the GROUP BY clause. This is NOT allowed in the earlier SQL standards (SQL-92 and before) . However, SQL-1999 and later permits such nonaggregated column , ON ONE CONDITION, which is the non aggregated column MUST be functionally dependent on the columns in the GROUP BY clause. See: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

For example, in an e-commerce platform, a user_id is unique. For every user_id, there is a unique user_name associated to it.(Note: the relationship should be defined in table structure, such as using PK/UNIQUE key) In this case, the user_name is functionally dependent on the user_id. Under such circumstance, if you want to get the total spending by each user_id, it’s legit to put user_name in the SELECT list without defining it in the GROUP BY clause:
```
select user_name , user_id , sum(spending)
...
group by user_id
;
```
But in your case, from the data without the use of MAX() or GROUP BY,
```
studentID   courseID    dateCreated
100         39          2023-09-07 02:48:46.000
102         39          2023-12-24 02:57:07.000
```
we can already see it is possible for a studentCourseID to have more than one studentID. Therefore, studentID is not functionally dependent on studentCourseID.(The reason that your query does not raise an error is because ONLY_FULL_GROUP_BY is not defined in the system variable sql_mode. ) So you should either put it in the GROUP BY list, or omit it from the SELECT list. Otherwise, how can MySQL know which studentID to return ? (Especially in the case of sum() ,when every studentID from a studentCourseID contributes to the aggregate value)

Your might think that MySQL should just automatically return the studentID which has the latest date for a courseID. Unfortunately, it does not work that way. We have to do some extra work. Here are the two steps to do the job:

First of all, we need to get the max date for each studentCourseID:
```
select studentCourseID, max(dateCreated) as maxdate
from derived_table
;
```
Then, we perform a JOIN on our original table and the aggregate table from previous step:
```
select t1.studentID, t1.studentCourseID, t1.dateCreated
from derived_table t1
join (select studentCourseID, max(dateCreated) as maxdate
     from derived_table) t2
on t1.studentCourseID=t2.studentCourseID and t1.dateCreated=t2.maxdate
;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.