skip to Main Content

In my PostgreSQL 14.8 database I have a table named orders that looks as follows:

CREATE TABLE orders (
  user_id int
, order_id int
, order_date date
, quantity int
, revenue float
, product text
);

INSERT INTO orders VALUES
(1, 1, '2021-03-05', 1, 15, 'books'),
(1, 2, '2022-03-07', 1, 3, 'music'),
(1, 3, '2022-06-15', 1, 900, 'travel'),
(1, 4, '2021-11-17', 2, 25, 'books'),
(2, 5, '2022-08-03', 2, 32, 'books'),
(2, 6, '2021-04-12', 2, 4, 'music'),
(2, 7, '2021-06-29', 3, 9, 'books'),
(2, 8, '2022-11-03', 1, 8, 'music'),
(3, 9, '2022-11-07', 1, 575, 'food'),
(3, 10, '2022-11-20', 2, 95, 'food'),
(3, 11, '2022-11-20', 1, 95, 'food'),
(4, 12, '2022-11-20', 2, 95, 'books'),
(4, 13, '2022-11-21', 1, 95, 'food'),
(4, 14, '2022-11-23', 4, 17, 'books'),
(5, 15, '2022-11-20', 1, 95, 'food'),
(5, 16, '2022-11-25', 2, 95, 'books'),
(5, 17, '2022-11-29', 1, 95, 'food');

See SQL fiddle: http://sqlfiddle.com/#!17/3dc69/1

I would like to get the following:

Among customers who first purchased a book (first according to order_date) as a product (in this case, customer IDs 1 and 4), I would like to get:

A) the average quantity of books that this cohort purchased (in this case, 2.25, which is the average of 1, 2, 2, and 4), and;

B) the total revenue of each of these purchases (in this case, 152).

I tried to use several CTEs to accomplish this. Here is my attempt:

WITH all_orders AS (
  SELECT
    *
    , ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.order_date) AS row_num
  FROM
    orders o
  ),

  first_product_is_books AS (
  SELECT
      DISTINCT(user_id) AS usr_id
  FROM
    all_orders
  WHERE product = 'books' AND row_num = 1
  ),

  temp_results AS (
  SELECT
    *
  FROM
    all_orders ao
  JOIN
    first_product_is_books AS fp
  ON ao.user_id = fp.usr_id
  )

  SELECT
    avg(quantity)
    , sum(revenue)
  FROM
    temp_results tr
  WHERE
     tr.product = 'books'

See fiddle: http://sqlfiddle.com/#!17/3dc69/1

This works on a toy dataset, but not on the production dataset, where the table is about 400,000 records. I think this query is hacky and not really optimized. Is there a more efficient way?

3

Answers


  1. How about this?

    with order_cte as 
    (
      SELECT
      *,
      ROW_NUMBER() OVER(partition by user_id order by order_date) as row_num
      from orders
     )
     
     SELECT
     AVG(QUANTITY)AVG_QUANTITY
     , SUM(REVENUE)TOTAL_REV
     FROM ORDERS
     WHERE PRODUCT = 'books'
       and USER_ID IN (
                       SELECT
                        USER_ID
                         FROM ORDER_CTE
                        WHERE PRODUCT = 'books'
                          AND ROW_NUM = 1
                     )
    
    Login or Signup to reply.
  2. This version of your query is much easier to read.

    WITH first_orders AS (
        SELECT
            DISTINCT
            user_id, 
            FIRST_VALUE(product) OVER (PARTITION BY user_id ORDER BY order_date) as first_product
        FROM orders
        GROUP BY user_id, order_date, product
    )
    
    SELECT 
        AVG(o.quantity) AS avg_quantity, 
        SUM(o.revenue) AS total_revenue
    FROM orders o
    JOIN first_orders fo
    ON o.user_id = fo.user_id
    WHERE fo.first_product = 'books' AND o.product = 'books';
    
    Login or Signup to reply.
  3. Simpler with DISTINCT ON:

    SELECT avg(o.quantity) AS avg_quantity
         , sum(o.revenue)  AS total_revenue
    FROM  (
       SELECT DISTINCT ON (user_id)
              user_id, product
       FROM   orders
       ORDER  BY user_id, order_date
       ) init
    JOIN   orders o USING (user_id, product)
    WHERE  init.product = 'books';
    

    fiddle

    Which query style is the most efficient primarily depends on cardinalities. How many rows total, how many users, how many orders for books, how many initial orders with books, avg row size, avg size of product.

    Can be simpler and faster if there is also a table users holding one row per relevant user_id.

    And you need an index to support the query. Ideally, on (user_id, order_date, product). Again, the best index depends on above mentioned undisclosed details.

    And (user_id, order_date) must be unique, or there is ambiguity.

    See:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search