I wrote two queries and thought they would give the same result, but it didn’t happen. Could you please explain to me why?
1)
SELECT date, COUNT(DISTINCT user_id)
FROM (SELECT user_id, date(time) as date, COUNT(order_id)
FROM user_actions
WHERE order_id NOT IN (SELECT order_id FROM user_actions WHERE action = 'cancel_order')
GROUP BY user_id,date(time)
HAVING COUNT(order_id) = 1)t1
GROUP BY date
2)
SELECT date(time) as date, COUNT(DISTINCT user_id)
FROM user_actions
WHERE user_id IN (SELECT user_id FROM (SELECT user_id, date(time), COUNT(order_id)
FROM user_actions
WHERE order_id NOT IN (SELECT order_id FROM user_actions WHERE action = 'cancel_order')
GROUP BY user_id,date(time)
HAVING COUNT(order_id) = 1)t1)
GROUP BY date(time)
2
Answers
In 1), the subquery t1 is retrieving all the users that had one (and only one) action per day of type other than ‘cancel_action’ and the day of this action, then the top query is counting how many of these users existed each day.
In 2), the subquery t1 is doing exactly the same as in 1), but then you are getting the user_ids and counting users per day of action. The problem is, even if the users exist in the set of users that had at most one action per day at some date, that doesn’t mean that some of these users could have had more than 1 action on the same day in different dates.
Example:
user_id = 1
user_id = 2
t1 returns:
Then, in 1) counting from t1 by date results in:
But in 2) you will get that extra date 02/05/2024 when user_id 2 made two actions, so you will get:
it’s a matter of filtering.
Your first query will only consider users with one order not cancelled per date.
Then count the distinct users per date.
Your second query will retrieve all users with one order not cancelled per date.
Then for each of these users count the distinct users per date.
So, imagine the following:
Your first query will return
Record #2 is excluded.
Your second query will return
As User1 has a legit order on Date1, the order on Date2 is also counted. Record #2 is no more filtered out.