Postgresql - Efficient way to return counts and averages without joining together multiple CTEs

zeroes_ones
September 29, 2023
137 views
3 votes
3 Answers

In my PostgreSQL 14.8 database I have a table named orders that looks as follows:

CREATE TABLE orders (
  user_id int
, order_id int
, order_date date
, quantity int
, revenue float
, product text
);

INSERT INTO orders VALUES
(1, 1, '2021-03-05', 1, 15, 'books'),
(1, 2, '2022-03-07', 1, 3, 'music'),
(1, 3, '2022-06-15', 1, 900, 'travel'),
(1, 4, '2021-11-17', 2, 25, 'books'),
(2, 5, '2022-08-03', 2, 32, 'books'),
(2, 6, '2021-04-12', 2, 4, 'music'),
(2, 7, '2021-06-29', 3, 9, 'books'),
(2, 8, '2022-11-03', 1, 8, 'music'),
(3, 9, '2022-11-07', 1, 575, 'food'),
(3, 10, '2022-11-20', 2, 95, 'food'),
(3, 11, '2022-11-20', 1, 95, 'food'),
(4, 12, '2022-11-20', 2, 95, 'books'),
(4, 13, '2022-11-21', 1, 95, 'food'),
(4, 14, '2022-11-23', 4, 17, 'books'),
(5, 15, '2022-11-20', 1, 95, 'food'),
(5, 16, '2022-11-25', 2, 95, 'books'),
(5, 17, '2022-11-29', 1, 95, 'food');

See SQL fiddle: http://sqlfiddle.com/#!17/3dc69/1

I would like to get the following:

Among customers who first purchased a book (first according to order_date) as a product (in this case, customer IDs 1 and 4), I would like to get:

A) the average quantity of books that this cohort purchased (in this case, 2.25, which is the average of 1, 2, 2, and 4), and;

B) the total revenue of each of these purchases (in this case, 152).

I tried to use several CTEs to accomplish this. Here is my attempt:

WITH all_orders AS (
  SELECT
    *
    , ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.order_date) AS row_num
  FROM
    orders o
  ),

  first_product_is_books AS (
  SELECT
      DISTINCT(user_id) AS usr_id
  FROM
    all_orders
  WHERE product = 'books' AND row_num = 1
  ),

  temp_results AS (
  SELECT
    *
  FROM
    all_orders ao
  JOIN
    first_product_is_books AS fp
  ON ao.user_id = fp.usr_id
  )

  SELECT
    avg(quantity)
    , sum(revenue)
  FROM
    temp_results tr
  WHERE
     tr.product = 'books'

See fiddle: http://sqlfiddle.com/#!17/3dc69/1

This works on a toy dataset, but not on the production dataset, where the table is about 400,000 records. I think this query is hacky and not really optimized. Is there a more efficient way?

Answers

How about this?

with order_cte as 
(
  SELECT
  *,
  ROW_NUMBER() OVER(partition by user_id order by order_date) as row_num
  from orders
 )
 
 SELECT
 AVG(QUANTITY)AVG_QUANTITY
 , SUM(REVENUE)TOTAL_REV
 FROM ORDERS
 WHERE PRODUCT = 'books'
   and USER_ID IN (
                   SELECT
                    USER_ID
                     FROM ORDER_CTE
                    WHERE PRODUCT = 'books'
                      AND ROW_NUM = 1
                 )

This version of your query is much easier to read.

WITH first_orders AS (
    SELECT
        DISTINCT
        user_id, 
        FIRST_VALUE(product) OVER (PARTITION BY user_id ORDER BY order_date) as first_product
    FROM orders
    GROUP BY user_id, order_date, product
)

SELECT 
    AVG(o.quantity) AS avg_quantity, 
    SUM(o.revenue) AS total_revenue
FROM orders o
JOIN first_orders fo
ON o.user_id = fo.user_id
WHERE fo.first_product = 'books' AND o.product = 'books';

- ErwinBrandstetter
- September 29, 2023 at 3:38 am
- 0 votes
0
Simpler with DISTINCT ON:
```
SELECT avg(o.quantity) AS avg_quantity
     , sum(o.revenue)  AS total_revenue
FROM  (
   SELECT DISTINCT ON (user_id)
          user_id, product
   FROM   orders
   ORDER  BY user_id, order_date
   ) init
JOIN   orders o USING (user_id, product)
WHERE  init.product = 'books';
```
fiddle

Which query style is the most efficient primarily depends on cardinalities. How many rows total, how many users, how many orders for books, how many initial orders with books, avg row size, avg size of product.

Can be simpler and faster if there is also a table users holding one row per relevant user_id.

And you need an index to support the query. Ideally, on (user_id, order_date, product). Again, the best index depends on above mentioned undisclosed details.

And (user_id, order_date) must be unique, or there is ambiguity.

See:
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest row per user
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Efficient way to return counts and averages without joining together multiple CTEs

Answers