this is my first post on stack overflow, so excuse any ignorance of posting etiquette or whatnot. I was working on this leetcode SQL puzzle, and am very confused as to why these two queries give differing results. They seem near identical to me? Is the presence of customer_pref_delivery_date
in the common-table expression somehow throwing off the first query?
I am using MySQL.
Problematic query:
WITH first_orders AS (
SELECT
customer_id,
MIN(order_date) AS order_date,
customer_pref_delivery_date
FROM Delivery
GROUP BY customer_id
)
SELECT
ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM first_orders
Correct query:
SELECT
ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM Delivery
WHERE (customer_id, order_date) IN (
SELECT
customer_id,
min(order_date)
FROM Delivery
GROUP BY customer_id
)
2
Answers
I’d expect the first query to be far more efficient.
Both samples have this basic inner query to find the first order per customer:
The first sample also includes the
customer_pref_delivery_date
at this level. Technically, you should probably also group by this column and most other databases would actually force you to do this (MySQL is kind of bad here). But the advantage is it lets you include that value as part of the initial pass through the data, and so the first query is able to effectively do this all at once.The second query looks through the original data and uses the inner/nested query for matching. In effect, it requires an additional pass through the source data, with a matching operating from the subquery at each step.
If it were me, I’d consolidate the first query to one level, as well as fix the grouping and otherwise use more standards-compliant options:
The first query is wrong because you’re including a column that is neither part of an aggregate function, or part of the group by. This is effectively:
Where
ANY()
does exactly what it says on in the tin, it will retrieve any value from all available rows. So with a really simple sample data set:When you run your query you might expect to get the value for
customer_pref_delivery_date
that corresponds to your order date returned byMIN(order_date)
but that is not guaranteed. An example of this is on db<>fiddle where the result is:So it has taken the correct minimum order from row with id 2, but has taken any value of
customer_pref_delivery_date
and that happens to be from row with id 1.So even though the first order actually does have a preferred delivery date that matches the order date, your query has brought back a mis-match of data and is giving incorrect results.
The best way of sorting this is to use
ROW_NUMBER()
to get your first order, this then gives you access to all columns associated with that order, e.g.This returns the first order per customer but ensures the delivery date corresponds with the first order:
You can then extend this to include your aggregation of orders:
Example on db<>fiddle