I have a dataset with ecommerce events:
create table sample_events(
event_date date,
"session" varchar,
"user" varchar,
page_type varchar,
event_type varchar,
product int8);
INSERT INTO sample_events (event_date,"session","user",page_type,event_type,product) values
('2022-10-01','session1user1','user1','product_page','page_view',0),
('2022-10-01','session1user2','user2','listing_page','page_view',0),
('2022-10-01','session1user2','user2','search_listing_page','page_view',0),
('2022-10-01','session1user3','user3','product_page','page_view',0),
('2022-10-01','session2user1','user1','product_page','add_to_cart',20969597),
('2022-10-02','session2user1','user1','order_page','order',0),
('2022-10-02','session2user3','user3','product_page','add_to_cart', 34856927),
('2022-10-02','session3user3','user3','product_page','add_to_cart', 19848603),
('2022-10-04','session4user3','user3','order_page','order',0);
What I need is to return in a single query a number of customers by day that:
- only viewed products in their first session;
- added only one product to the cart;
- placed an order within two days after the first session.
For the first I tried:
with
cte1 as(
select
*,
first_value("session") over(partition by event_date, "user" order by event_date) as day_first_session,
case
when page_type = 'product_page' and event_type = 'page_view' then 1 else 0
end as is_product_view
from sample_events se
)
select
event_date,
count(distinct "user")
from (
select
*,
min(is_product_view) over(partition by day_first_session) as min_indicator,
max(is_product_view) over(partition by day_first_session) as max_indicator
from cte1)t
where min_indicator=max_indicator and max_indicator = 1
group by 1
It works but don’t think it is the most efficient one. And with that approach we can return all the three requirements only by joining tables by date and I desire to avoid joins in the query. Is there any way to compose the query?
2
Answers
Well, it seems there is no other way rather then join tables by dates. But for now I guess it wouldn't be slow/expensive because of even for ten years they would have around 10K rows only. My solution looks like this:
If I’ve understood the requirements correctly, then the following query returns the expected results:
So how does this query work? The first CTE, date_first_sessions, determines first sessions by user and date. The second CTE, first_sessions, gets the date of each users first session. The third CTE, met_criteria, determines for each user and date which of the conditions of interest have been satisfied. It uses the aggregated first day’s sessions and first session dates from date_first_sessions and first_sessions. The final
SELECT
counts the number of users that satisfied each criteria on each date.