skip to Main Content

this is my first post on stack overflow, so excuse any ignorance of posting etiquette or whatnot. I was working on this leetcode SQL puzzle, and am very confused as to why these two queries give differing results. They seem near identical to me? Is the presence of customer_pref_delivery_date in the common-table expression somehow throwing off the first query?

I am using MySQL.

Problematic query:

WITH first_orders AS (
    SELECT
        customer_id,
        MIN(order_date) AS order_date,
        customer_pref_delivery_date
    FROM Delivery
    GROUP BY customer_id
)
SELECT 
    ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM first_orders

Correct query:

SELECT
    ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
FROM Delivery
WHERE (customer_id, order_date) IN (
    SELECT 
        customer_id,
        min(order_date)
    FROM Delivery
    GROUP BY customer_id
)

2

Answers


  1. I’d expect the first query to be far more efficient.

    Both samples have this basic inner query to find the first order per customer:

    SELECT customer_id,
        min(order_date)
    FROM Delivery
    GROUP BY customer_id
    

    The first sample also includes the customer_pref_delivery_date at this level. Technically, you should probably also group by this column and most other databases would actually force you to do this (MySQL is kind of bad here). But the advantage is it lets you include that value as part of the initial pass through the data, and so the first query is able to effectively do this all at once.

    The second query looks through the original data and uses the inner/nested query for matching. In effect, it requires an additional pass through the source data, with a matching operating from the subquery at each step.


    If it were me, I’d consolidate the first query to one level, as well as fix the grouping and otherwise use more standards-compliant options:

    SELECT
        customer_id,
        ROUND(SUM(case when min(order_date) = customer_pref_delivery_date then 1 else 0 end) / COUNT(*) * 100, 2) AS immediate_percentage
    FROM Delivery
    GROUP BY customer_id, customer_pref_delivery_date
    
    Login or Signup to reply.
  2. The first query is wrong because you’re including a column that is neither part of an aggregate function, or part of the group by. This is effectively:

    SELECT
        customer_id,
        MIN(order_date) AS order_date,
        ANY(customer_pref_delivery_date) AS customer_pref_delivery_date
    FROM Delivery
    GROUP BY customer_id;
    

    Where ANY() does exactly what it says on in the tin, it will retrieve any value from all available rows. So with a really simple sample data set:

    id customer_id order_date customer_pref_delivery_date
    1 1 2024-02-02 2024-02-05
    2 1 2024-02-01 2024-02-01

    When you run your query you might expect to get the value for customer_pref_delivery_date that corresponds to your order date returned by MIN(order_date) but that is not guaranteed. An example of this is on db<>fiddle where the result is:

    customer_id order_date customer_pref_delivery_date
    1 2024-02-01 2024-02-05

    So it has taken the correct minimum order from row with id 2, but has taken any value of customer_pref_delivery_date and that happens to be from row with id 1.

    So even though the first order actually does have a preferred delivery date that matches the order date, your query has brought back a mis-match of data and is giving incorrect results.

    The best way of sorting this is to use ROW_NUMBER() to get your first order, this then gives you access to all columns associated with that order, e.g.

    WITH RankedOrders AS
    (   SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date) AS RowNum
        FROM Delivery
    )
    SELECT id, customer_id, order_date, customer_pref_delivery_date
    FROM RankedOrders AS rd
    WHERE RowNum = 1;
    

    This returns the first order per customer but ensures the delivery date corresponds with the first order:

    id customer_id order_date customer_pref_delivery_date
    2 1 2024-02-01 2024-02-01

    You can then extend this to include your aggregation of orders:

    WITH RankedOrders AS
    (   SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date) AS RowNum
        FROM Delivery
    )
    SELECT ROUND(SUM(IF(order_date = customer_pref_delivery_date, 1, 0)) / COUNT(*) * 100, 2) AS immediate_percentage
    FROM RankedOrders AS rd
    WHERE RowNum = 1;
    

    Example on db<>fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search