In my PostgreSQL 14.8 database I have a table named orders
that looks as follows:
CREATE TABLE orders (
user_id int
, order_id int
, order_date date
, quantity int
, revenue float
, product text
);
INSERT INTO orders VALUES
(1, 1, '2021-03-05', 1, 15, 'books'),
(1, 2, '2022-03-07', 1, 3, 'music'),
(1, 3, '2022-06-15', 1, 900, 'travel'),
(1, 4, '2021-11-17', 2, 25, 'books'),
(2, 5, '2022-08-03', 2, 32, 'books'),
(2, 6, '2021-04-12', 2, 4, 'music'),
(2, 7, '2021-06-29', 3, 9, 'books'),
(2, 8, '2022-11-03', 1, 8, 'music'),
(3, 9, '2022-11-07', 1, 575, 'food'),
(3, 10, '2022-11-20', 2, 95, 'food'),
(3, 11, '2022-11-20', 1, 95, 'food'),
(4, 12, '2022-11-20', 2, 95, 'books'),
(4, 13, '2022-11-21', 1, 95, 'food'),
(4, 14, '2022-11-23', 4, 17, 'books'),
(5, 15, '2022-11-20', 1, 95, 'food'),
(5, 16, '2022-11-25', 2, 95, 'books'),
(5, 17, '2022-11-29', 1, 95, 'food');
See SQL fiddle: http://sqlfiddle.com/#!17/3dc69/1
I would like to get the following:
Among customers who first purchased a book (first according to order_date
) as a product
(in this case, customer IDs 1 and 4), I would like to get:
A) the average quantity
of books that this cohort purchased (in this case, 2.25, which is the average of 1, 2, 2, and 4), and;
B) the total revenue
of each of these purchases (in this case, 152).
I tried to use several CTEs to accomplish this. Here is my attempt:
WITH all_orders AS (
SELECT
*
, ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.order_date) AS row_num
FROM
orders o
),
first_product_is_books AS (
SELECT
DISTINCT(user_id) AS usr_id
FROM
all_orders
WHERE product = 'books' AND row_num = 1
),
temp_results AS (
SELECT
*
FROM
all_orders ao
JOIN
first_product_is_books AS fp
ON ao.user_id = fp.usr_id
)
SELECT
avg(quantity)
, sum(revenue)
FROM
temp_results tr
WHERE
tr.product = 'books'
See fiddle: http://sqlfiddle.com/#!17/3dc69/1
This works on a toy dataset, but not on the production dataset, where the table is about 400,000 records. I think this query is hacky and not really optimized. Is there a more efficient way?
3
Answers
How about this?
This version of your query is much easier to read.
Simpler with
DISTINCT ON
:fiddle
Which query style is the most efficient primarily depends on cardinalities. How many rows total, how many users, how many orders for books, how many initial orders with books, avg row size, avg size of
product
.Can be simpler and faster if there is also a table
users
holding one row per relevantuser_id
.And you need an index to support the query. Ideally, on
(user_id, order_date, product)
. Again, the best index depends on above mentioned undisclosed details.And
(user_id, order_date)
must be unique, or there is ambiguity.See: