skip to Main Content

I am trying to understand why a column that I expect to hold nondetermistic values after a GROUP BY actually contains the correct values, despite the fact that the GROUP BY columns are not sufficient to identify a row.

Please consider the following query, which retrieves the oldest price for each brand-seller-product combination:

SELECT
    price, brand, seller, product, min(date) AS date
FROM
    priceHistory
GROUP BY
    brand, seller, product;

At first, I was expecting this query to return a nondeterministic value for price. Instead of that, I invariably get the expected price value for the returned date.

So, I looked for an explanation in MySQL’s documentation about GROUP BY. There, I found out about functional dependences, and learned that MySQL is able to determine the correct values for the SELECT columns that are not mentionned in the GROUP BY clause, when the GROUP BY column (or set of columns) is a PRIMARY KEY or a UNIQUE KEY.

Now, I do have a UNIQUE KEY declared in my table:

UNIQUE KEY `uniquePrice` (`brand`,`seller`,`product`,`date`),

This UNIQUE KEY has one column (date) that is not included in the GROUP BY columns. However, the min(date) AS date expression in the SELECT clause does return a unique date value.

So, my guess is that MySQL combines the brand, seller, and product columns from the GROUP BY clause on one hand, and the unique date value from the SELECT clause on the other hand, to "reconstruct" the UNIQUE KEY. At which point it is able to establish functional dependence, and identify the row from which the price value should be retrieved.

Is my assumption correct?

If so, could you show me how I could have verified it on my own?

If not, do you have an alternative explanation?

2

Answers


  1. Chosen as BEST ANSWER

    After processing all the comments to my question and doing a little more investigation, I would say that values from the SELECT clause are not used to establish functional dependence. At the very least, I can say for sure that it's not what's happening in my case.

    My assumption about functional dependence being established was incorrect. It can be demonstrated very easily by substituting max for min in the query. The result rows then contain the most recent date values, as expected, but still have the oldest price for each brand-seller-product group.

    It turns out that my query was indeed returning the expected prices by coincidence.

    For the sake of completeness, and as explained by @ysth here, a nice, simple way to guarantee that the price and date values are retrieved from the same row would be:

    SELECT
        SUBSTRING(MIN(CONCAT(date, price)) FROM 20) AS price, brand, seller, product, MIN(date) AS date
    FROM
        priceHistory
    GROUP BY
        brand, seller, product;
    

  2. Using a Window function partitioned what you have GROUP BY to be allows the elements of price and date to match. By ordering these in date ASC means the first row of each windows is the one you want.

    WITH summary AS
      (SELECT p.price,
              p.brand,
              p.seller,
              p.product,
              p.`date` AS date ROW_NUMBER() OVER(PARTITION BY p.brand, p.seller, p.product
                                                 ORDER BY p.`date` ASC) AS rank
       FROM priceHistory p)
    SELECT *
    FROM summary
    WHERE rank = 1
    

    ref: other answer

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search