skip to Main Content

I am learning SQL and have created the following dataset in MySQL:

Create table Departments (DepartmentID int primary key, Name text);

insert into Departments values
(1001,'SRO'),
(2001,'Drs'),
(3001,'Accounting');

Create table Employees 
(
  EmployeeID int not null, 
  DepartmentID int not null,
  ManagerID int,
  Name varchar(50) not null,
  Salary int not null,
  primary key(EmployeeID),
  foreign key (DepartmentID)
    references Departments(DepartmentID)
 );

insert into Employees values
(68319,1001,NULL,'Kayling','6000.00'),
(66928,3001,68319,'Blaze','2750.00'),
(67832,1001,68319,'Clare','2550.00'),
(65646,2001,68319,'Jonas','2957.00'),
(67858,2001,65646,'Scarlet','3100.00'),
(69062,2001,65646,'Frank','3100.00'),
(63679,2001,69062,'Sandrine','900.00'),
(64989,3001,66928,'Adelyn','1700.00'),
(65271,3001,66928,'Wade','1350.00');

I want to find the maximum salary for each department, researching many of the articles I saw suggested running MAX in this form:

SELECT max(Salary),DepartmentID 
FROM Employees
GROUP BY DepartmentID;

However, given the order of operations as described in the answer here

FROM, including JOINs
WHERE
SELECT the row  obtained  by from and where in a temporary area for others 
        operation (and build the column alias)
DISTINCT
GROUP BY
HAVING
ORDER BY
LIMIT and OFFSET
return the final result

The MAX function is called before the GROUP BY, which means the MAX will work across the whole table before the GROUP BY has even been performed, which from my understanding should result in the presence of a single maximum value across the table which should be 6000. However, that’s not what happens, the results I do see are confusingly in the right structure that I want:

max(Salary) DepartmentID
6000        1001
3100        2001
2750        3001

somehow the MAX function works after a GROUP BY but also before losing all the rows due to the GROUP BY dropping multiple group rows.

If I go ahead with this understanding it throws up more questions for me.

If a GROUP BY operation occurs before the SELECT, then the GROUP BY will remove all the rows for each DepartmentID and leave only a single row randomly chosen, which means by the time the MAX function in SELECT has a chance to even run it will only see one salary value which could be any value rather than the maximum for each Department.

What am I missing that is complicating my understanding?

2

Answers


  1. Aggregate functions like MAX (or SUM, MIN, COUNT etc) are applied to the sets created by the GROUP BY clause.

    In your example the salaries for each department are divided into separate sets and the MAX-function is applied to each set giving the maximun salary for each department.

    Login or Signup to reply.
  2. The thing we need to understand when we evaluate a query is that there is a projection before the group by and there is a projection after the group by.

    Before the group by we define what relation we are to group from. This may be an entire table, a projected table (the columns are different than initially defined), a filtered table (where clause and joins) or even an ad-hoc table (like (select 1 as foo union select 2 as foo union select 3 as foo)).

    GROUP BY is the process of aggregating the results. This means that things like MAX() or MIN(), AVG() by default (notwithstanding indexes) is not evaluatable before the records that contribute to the result were not evaluated fully.

    So, how would you evaluate the MAX value of something, for example:

    (pseudo-code)

    max <- -infinity
    for r of rows do
        if max < r.yourfield then max <- r.yourfield
    end for
    

    and this is what happens in your case. MAX(yourfield) marks that for this field you need to find out what the maximum is for the result set. And if there was a GROUP BY, then the pseudo-code changes to something like this:

    max <- -infinity
    for r of rows do
        if max[group(row)] < r.yourfield then max[group(row)] <- r.yourfield
    end for
    

    So, the records are ungrouped and are being processed. As the groups are being built, the separate aggregated values are being built up separately.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search