skip to Main Content

I have a dataset with ecommerce events:

create table sample_events(
event_date date,
"session" varchar,
"user" varchar,
page_type varchar,
event_type varchar,
product int8);

INSERT INTO sample_events (event_date,"session","user",page_type,event_type,product) values
('2022-10-01','session1user1','user1','product_page','page_view',0),
('2022-10-01','session1user2','user2','listing_page','page_view',0),
('2022-10-01','session1user2','user2','search_listing_page','page_view',0),
('2022-10-01','session1user3','user3','product_page','page_view',0),
('2022-10-01','session2user1','user1','product_page','add_to_cart',20969597),
('2022-10-02','session2user1','user1','order_page','order',0),
('2022-10-02','session2user3','user3','product_page','add_to_cart', 34856927),
('2022-10-02','session3user3','user3','product_page','add_to_cart', 19848603),
('2022-10-04','session4user3','user3','order_page','order',0);

What I need is to return in a single query a number of customers by day that:

  • only viewed products in their first session;
  • added only one product to the cart;
  • placed an order within two days after the first session.

For the first I tried:

with 
cte1 as(
    select
        *,
        first_value("session") over(partition by event_date, "user" order by event_date) as day_first_session,
        case 
            when page_type = 'product_page' and event_type = 'page_view' then 1 else 0
        end as is_product_view
    from sample_events se
)
select
    event_date,
    count(distinct "user")
from (
    select
    *,
    min(is_product_view) over(partition by day_first_session) as min_indicator,
    max(is_product_view) over(partition by day_first_session) as max_indicator
    from cte1)t
where min_indicator=max_indicator and max_indicator = 1
group by 1

It works but don’t think it is the most efficient one. And with that approach we can return all the three requirements only by joining tables by date and I desire to avoid joins in the query. Is there any way to compose the query?

2

Answers


  1. Chosen as BEST ANSWER

    Well, it seems there is no other way rather then join tables by dates. But for now I guess it wouldn't be slow/expensive because of even for ten years they would have around 10K rows only. My solution looks like this:

    with
    cte1 as(
        select
            *,
            first_value("session") over(partition by event_date, "user" order by event_date) as day_first_session,
            first_value(event_date) over(partition by "user" order by event_date) as first_session_timestamp,
            case 
                when page_type = 'product_page' and event_type = 'page_view' then 1 else 0
            end as is_product_view
        from sample_events se
    ),
    productview_only as(
        select
            event_date,
            count(distinct "user") as viewed_products_only
        from (
            select
                *,
                min(is_product_view) over(partition by day_first_session) as min_indicator,
                max(is_product_view) over(partition by day_first_session) as max_indicator
            from cte1
        )t
        where min_indicator = max_indicator and max_indicator = 1
        group by 1
    ),
    order_within_2d as(
        select 
            event_date,
            count(distinct "user") as placed_order_within_2d
        from cte1
        where event_date - first_session_timestamp <= 2
        and event_type = 'order'
        group by 1  
    ),
    added_one_product as(
        select 
            event_date,
            count(distinct "user") as added_one_product_only
        from(
            select 
                *,
                count(event_type) over(partition by event_date, "user") as adds_to_cart
            from sample_events se 
            where event_type = 'add_to_cart'
        )t
        where adds_to_cart = 1
        group by 1
    )
    select 
        po.event_date,
        po.viewed_products_only,
        aop.added_one_product_only,
        ow2d.placed_order_within_2d
    from productview_only po
    join added_one_product aop
        using(event_date)
    join order_within_2d ow2d
        using(event_date)
    

  2. If I’ve understood the requirements correctly, then the following query returns the expected results:

    WITH date_first_sessions AS (
        SELECT s.USER,
               s.event_date,
               min(s.session) AS session
          FROM sample_events s
         GROUP BY s.user,
                  s.event_date),
    first_sessions AS (
        SELECT s.user,
               min(s.event_date) as event_date
          FROM sample_events s
         GROUP BY s.user),
    met_criteria AS (
        SELECT s.event_date,
               s.USER,
               min(CASE
                     WHEN s.page_type = 'product_page' AND s.event_type = 'page_view' THEN 1
                     ELSE 0
                   END) FILTER (WHERE s.session = dfs.session) = 1 AS only_product_viewed,
               sum(CASE
                     WHEN s.event_type = 'add_to_cart' THEN 1
                     ELSE 0
                   END) = 1 AS added_only_one,
               sum(CASE
                     WHEN s.event_type = 'order' THEN 1
                     ELSE 0
                   END) FILTER (WHERE fs.event_date BETWEEN s.event_date - 2 AND s.event_date) > 0 AS ordered_within_2_days
          FROM sample_events s
               JOIN date_first_sessions dfs
                 ON (s.user = dfs.user AND s.event_date = dfs.event_date)
               JOIN first_sessions fs
                 ON (s.user = fs.user)
         GROUP BY s.event_date,
                  s.USER)
    SELECT m.event_date,
        sum(CASE
              WHEN m.only_product_viewed THEN 1
              ELSE 0
            END) AS num_only_product_viewed,
        sum(CASE
              WHEN m.added_only_one THEN 1
              ELSE 0
            END) AS num_added_only_one,
        sum(CASE
              WHEN m.ordered_within_2_days THEN 1
              ELSE 0
            END) AS num_ordered_within_2_days
      FROM met_criteria m
     GROUP BY m.event_date
     ORDER BY m.event_date;
    

    So how does this query work? The first CTE, date_first_sessions, determines first sessions by user and date. The second CTE, first_sessions, gets the date of each users first session. The third CTE, met_criteria, determines for each user and date which of the conditions of interest have been satisfied. It uses the aggregated first day’s sessions and first session dates from date_first_sessions and first_sessions. The final SELECT counts the number of users that satisfied each criteria on each date.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search