I am learning SQL and have created the following dataset in MySQL:
Create table Departments (DepartmentID int primary key, Name text);
insert into Departments values
(1001,'SRO'),
(2001,'Drs'),
(3001,'Accounting');
Create table Employees
(
EmployeeID int not null,
DepartmentID int not null,
ManagerID int,
Name varchar(50) not null,
Salary int not null,
primary key(EmployeeID),
foreign key (DepartmentID)
references Departments(DepartmentID)
);
insert into Employees values
(68319,1001,NULL,'Kayling','6000.00'),
(66928,3001,68319,'Blaze','2750.00'),
(67832,1001,68319,'Clare','2550.00'),
(65646,2001,68319,'Jonas','2957.00'),
(67858,2001,65646,'Scarlet','3100.00'),
(69062,2001,65646,'Frank','3100.00'),
(63679,2001,69062,'Sandrine','900.00'),
(64989,3001,66928,'Adelyn','1700.00'),
(65271,3001,66928,'Wade','1350.00');
I want to find the maximum salary for each department, researching many of the articles I saw suggested running MAX
in this form:
SELECT max(Salary),DepartmentID
FROM Employees
GROUP BY DepartmentID;
However, given the order of operations as described in the answer here
FROM, including JOINs
WHERE
SELECT the row obtained by from and where in a temporary area for others
operation (and build the column alias)
DISTINCT
GROUP BY
HAVING
ORDER BY
LIMIT and OFFSET
return the final result
The MAX
function is called before the GROUP BY
, which means the MAX
will work across the whole table before the GROUP BY
has even been performed, which from my understanding should result in the presence of a single maximum value across the table which should be 6000. However, that’s not what happens, the results I do see are confusingly in the right structure that I want:
max(Salary) DepartmentID
6000 1001
3100 2001
2750 3001
somehow the MAX
function works after a GROUP BY
but also before losing all the rows due to the GROUP BY
dropping multiple group rows.
If I go ahead with this understanding it throws up more questions for me.
If a GROUP BY
operation occurs before the SELECT
, then the GROUP BY
will remove all the rows for each DepartmentID
and leave only a single row randomly chosen, which means by the time the MAX
function in SELECT
has a chance to even run it will only see one salary value which could be any value rather than the maximum for each Department.
What am I missing that is complicating my understanding?
2
Answers
Aggregate functions like MAX (or SUM, MIN, COUNT etc) are applied to the sets created by the
GROUP BY
clause.In your example the salaries for each department are divided into separate sets and the MAX-function is applied to each set giving the maximun salary for each department.
The thing we need to understand when we evaluate a query is that there is a projection before the
group by
and there is a projection after thegroup by
.Before the
group by
we define what relation we are to group from. This may be an entire table, a projected table (the columns are different than initially defined), a filtered table (where
clause and joins) or even an ad-hoc table (like (select 1 as foo union select 2 as foo union select 3 as foo
)).GROUP BY
is the process of aggregating the results. This means that things likeMAX()
orMIN()
,AVG()
by default (notwithstanding indexes) is not evaluatable before the records that contribute to the result were not evaluated fully.So, how would you evaluate the MAX value of something, for example:
(pseudo-code)
and this is what happens in your case.
MAX(yourfield)
marks that for this field you need to find out what the maximum is for the result set. And if there was aGROUP BY
, then the pseudo-code changes to something like this:So, the records are ungrouped and are being processed. As the groups are being built, the separate aggregated values are being built up separately.