I am trying to understand why a column that I expect to hold nondetermistic values after a GROUP BY
actually contains the correct values, despite the fact that the GROUP BY
columns are not sufficient to identify a row.
Please consider the following query, which retrieves the oldest price for each brand-seller-product combination:
SELECT
price, brand, seller, product, min(date) AS date
FROM
priceHistory
GROUP BY
brand, seller, product;
At first, I was expecting this query to return a nondeterministic value for price
. Instead of that, I invariably get the expected price
value for the returned date
.
So, I looked for an explanation in MySQL’s documentation about GROUP BY
. There, I found out about functional dependences, and learned that MySQL is able to determine the correct values for the SELECT
columns that are not mentionned in the GROUP BY
clause, when the GROUP BY
column (or set of columns) is a PRIMARY KEY
or a UNIQUE KEY
.
Now, I do have a UNIQUE KEY
declared in my table:
UNIQUE KEY `uniquePrice` (`brand`,`seller`,`product`,`date`),
This UNIQUE KEY
has one column (date
) that is not included in the GROUP BY
columns. However, the min(date) AS date
expression in the SELECT
clause does return a unique date
value.
So, my guess is that MySQL combines the brand
, seller
, and product
columns from the GROUP BY
clause on one hand, and the unique date
value from the SELECT
clause on the other hand, to "reconstruct" the UNIQUE KEY
. At which point it is able to establish functional dependence, and identify the row from which the price
value should be retrieved.
Is my assumption correct?
If so, could you show me how I could have verified it on my own?
If not, do you have an alternative explanation?
2
Answers
After processing all the comments to my question and doing a little more investigation, I would say that values from the SELECT clause are not used to establish functional dependence. At the very least, I can say for sure that it's not what's happening in my case.
My assumption about functional dependence being established was incorrect. It can be demonstrated very easily by substituting
max
formin
in the query. The result rows then contain the most recentdate
values, as expected, but still have the oldestprice
for each brand-seller-product group.It turns out that my query was indeed returning the expected
price
s by coincidence.For the sake of completeness, and as explained by @ysth here, a nice, simple way to guarantee that the
price
anddate
values are retrieved from the same row would be:Using a Window function partitioned what you have GROUP BY to be allows the elements of
price
anddate
to match. By ordering these indate ASC
means the first row of each windows is the one you want.ref: other answer