Mysql - What makes these two SQL queries different? Stumping me completely

danielyankin
February 19, 2024
284 views
1 vote
2 Answers

this is my first post on stack overflow, so excuse any ignorance of posting etiquette or whatnot. I was working on this leetcode SQL puzzle, and am very confused as to why these two queries give differing results. They seem near identical to me? Is the presence of customer_pref_delivery_date in the common-table expression somehow throwing off the first query?

I am using MySQL.

Problematic query:

WITH first_orders AS (
    SELECT
        customer_id,
        MIN(order_date) AS order_date,
        customer_pref_delivery_date
    FROM Delivery
    GROUP BY customer_id
)
SELECT 
    ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM first_orders

Correct query:

SELECT
    ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM Delivery
WHERE (customer_id, order_date) IN (
    SELECT 
        customer_id,
        min(order_date)
    FROM Delivery
    GROUP BY customer_id
)

Tags: mysql sql

Answers

- JoelCoehoorn
- February 19, 2024 at 6:41 pm
- 0 votes
0
I’d expect the first query to be far more efficient.

Both samples have this basic inner query to find the first order per customer:
```
SELECT customer_id,
    min(order_date)
FROM Delivery
GROUP BY customer_id
```
The first sample also includes the customer_pref_delivery_date at this level. Technically, you should probably also group by this column and most other databases would actually force you to do this (MySQL is kind of bad here). But the advantage is it lets you include that value as part of the initial pass through the data, and so the first query is able to effectively do this all at once.

The second query looks through the original data and uses the inner/nested query for matching. In effect, it requires an additional pass through the source data, with a matching operating from the subquery at each step.

If it were me, I’d consolidate the first query to one level, as well as fix the grouping and otherwise use more standards-compliant options:
```
SELECT
    customer_id,
    ROUND(SUM(case when min(order_date) = customer_pref_delivery_date then 1 else 0 end) / COUNT(*) * 100, 2) AS immediate_percentage
FROM Delivery
GROUP BY customer_id, customer_pref_delivery_date
```
Login or Signup to reply.

- GarethD
- February 19, 2024 at 7:10 pm
- 0 votes
0
The first query is wrong because you’re including a column that is neither part of an aggregate function, or part of the group by. This is effectively:
```
SELECT
    customer_id,
    MIN(order_date) AS order_date,
    ANY(customer_pref_delivery_date) AS customer_pref_delivery_date
FROM Delivery
GROUP BY customer_id;
```
Where ANY() does exactly what it says on in the tin, it will retrieve any value from all available rows. So with a really simple sample data set:

id customer_id order_date customer_pref_delivery_date

1 1 2024-02-02 2024-02-05

2 1 2024-02-01 2024-02-01

When you run your query you might expect to get the value for customer_pref_delivery_date that corresponds to your order date returned by MIN(order_date) but that is not guaranteed. An example of this is on db<>fiddle where the result is:

customer_id order_date customer_pref_delivery_date

1 2024-02-01 2024-02-05

So it has taken the correct minimum order from row with id 2, but has taken any value of customer_pref_delivery_date and that happens to be from row with id 1.

So even though the first order actually does have a preferred delivery date that matches the order date, your query has brought back a mis-match of data and is giving incorrect results.

The best way of sorting this is to use ROW_NUMBER() to get your first order, this then gives you access to all columns associated with that order, e.g.
```
WITH RankedOrders AS
(   SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date) AS RowNum
    FROM Delivery
)
SELECT id, customer_id, order_date, customer_pref_delivery_date
FROM RankedOrders AS rd
WHERE RowNum = 1;
```
This returns the first order per customer but ensures the delivery date corresponds with the first order:

id customer_id order_date customer_pref_delivery_date

2 1 2024-02-01 2024-02-01

You can then extend this to include your aggregation of orders:
```
WITH RankedOrders AS
(   SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date) AS RowNum
    FROM Delivery
)
SELECT ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM RankedOrders AS rd
WHERE RowNum = 1;
```
Example on db<>fiddle
Login or Signup to reply.