I’m working on MySQL workbench and I would like to find the top-selling products by types of buyers. First I categorized customers based on their buying activity. If a person made a purchase only on 1/1, it’s a first-time buyer. If a person made a purchase on 1/1 and 1/2, it’s a second-time buyer. If a person made a purchase on 1/1, 1/2, and 1/3, it’s a third-time buyer. And so on.
The following is sample data, similar to my original data.
user_id | PurchaseDate | Product | Total | order_id |
---|---|---|---|---|
1 | 2023-01-01 16:46 | yogurt | 3.72 | 1000 |
1 | 2023-01-01 18:23 | milk | 4.38 | null |
2 | 2023-01-01 12:34 | egg | 4.57 | 1001 |
1 | 2023-01-01 11:54 | butter | 3.69 | null |
3 | 2023-01-01 10:41 | cheese | 3.68 | 1003 |
2 | 2023-01-01 13:45 | milk | 4.38 | null |
3 | 2023-01-03 18:56 | yogurt | 3.72 | 1005 |
3 | 2023-01-02 19:12 | cream | 4.98 | 1004 |
2 | 2023-01-02 12:47 | egg | 4.57 | 1002 |
If a customer makes more than one purchase on the same date, the order_id is null.
So, I used that to categorize the buyers.
SELECT user_id, product, sum(Total) as Total,
CASE WHEN Count(DISTINCT(order_id)) = 1 THEN "FirstTimeBuyer"
WHEN Count(DISTINCT(order_id)) = 2 THEN "SecondTimeBuyer"
WHEN Count(DISTINCT(order_id)) = 3 THEN "ThirdTimeBuyer"
END AS TypeofBuyer
FROM myData
GROUP BY user_id
ORDER BY Total DESC;
And I get the following result
user_id | Product | Total | TypeofBuyer |
---|---|---|---|
2 | egg | 13.52 | SecondTimeBuyer |
3 | cheese | 12.38 | ThirdTimeBuyer |
1 | yogurt | 11.79 | FirstTimeBuyer |
But the problem is since I aggregated the total and grouped it by user_id, I cannot find the top-selling product for each user_id.
For instance, for user_id 1, the top-selling product is milk and the customer is a first-time buyer.
How can I extract the above information?
I’d appreciate your help!
Edit: Eventually, I’m aiming to visualize the top 10 selling products for 1st-time buyers, 2nd-time buyers, and so forth (my original data is much bigger than the sample)
4
Answers
You can use a CTE.
Output:
Try this:
Using
ROW_NUMBER()
to get top sales and assign with row numbering andSUM() OVER ()
to get total sales byuser_id
in the firstcte
. Then re-using your method of gettingTypeofBuyer
in the secondcte
. Lastly, a query to join both of thectes
and filter only rows with row number=1 to get your desired result.Demo fiddle
Since you didn’t specify which mysql version you are using, it’s possible you’re using a sub-8.0 one, in which window functions such as CTE are not supported. Therefore, I’m taking a version-neutral approach.
First of all, let’s get the top-selling product for each user_id.
The rest is to join the previous query to your query.
Your current CASE operation is doing more work than it needs to:
Internally, this will require a separate temporary table for each
COUNT(DISTINCT)
, even though it is essentially comparing the same value. Rewriting it like:requires only one temporary table to evaluate the three cases. It is not a significant overhead, but why do more work than is necessary?
The "removal" of the
order_id
s, for subsequent orders on the same day, breaks your relational model and I do not understand the intent. If it is just for the sake of calculating yourTypeofBuyer
, then perhapsCOUNT(DISTINCT DATE(PurchaseDate))
is a better option.Yet another way to retrieve the
top-selling product
is with a correlated subquery. As with many things in the SQL world, there are many ways to achieve the same end goal, and you should test with a real world representative dataset to see what works best with your data:Output:
Here’s a db<>fiddle