skip to Main Content

I’ve got a dataset (in postgres specifically) that I need to calculate the ‘user session time’ for. I’m fairly positive I need to use partitioning to answer the question, but I’m not really sure how to go about it. The events table as an id, a user_id, the event category, and a date. When a user signs in, we put an event in the table for that user. Each time they visit a page, there’s a similar event. The table looks similar to this:

id user_id category date
1 100 Sign-In 2023-12-20 01:07:03
2 100 Page View 2023-12-20 01:07:05
3 100 Page View 2023-12-20 01:09:17
4 107 Sign-In 2023-12-20 01:09:19
5 107 Page View 2023-12-20 01:09:21
6 107 Page View 2023-12-20 01:09:50
7 100 Page View 2023-12-20 01:11:28
.. .. .. ..
784 100 Sign-In 2023-12-21 17:37:31

To calculate the session time from user 100, let’s assume that event 7 is their last page view in the system for that session. We need to get the Sign-In event (id 1), find the last event before the Sign-In (id 7), and subtract the two times. That’d be 2023-12-20 01:11:28 - 2023-12-20 01:07:03, or 4 minutes and 25 seconds. I’d want to do this for each user/sign-in combination in the dataset. For completeness, it’s possible that the user doesn’t have any Page Views, in which case their session time is zero. The results should look something like this:

user_id session_start duration
100 2023-12-20 01:07:03 00:04:25
100 2023-12-21 17:37:31 null (or 00:00:00)
107 2023-12-20 01:09:19 00:00:31

My guess is that I want to partition by user, and then by the event category, but I need the row before the one selected potentially, and not the next window. In code, I’d normally just take all the events, order them by user then by duration, and start to build a new result set by just iterating through and creation a ‘session’ each time I have a sign-in event. I’m just curious if this could be done in SQL on the database more efficiently.

Thanks for any help or pointers!

2

Answers


  1. if you assemble the queries in a CTE something like the following should work :-

    with sign_in_table as ( 
      select user_id, date
      from events 
      where category = 'Sign-in' ),
    
    last_activity_table as ( 
      select user_id , max(date) as date
      from events
      where category <> 'Sign-in'
      group by user) 
    
    select user_id, 
           sign_in_table.date as session_start, 
           last_activity_table.date - sign_in_table.date as duration
    from sign_in_table
    left join last_activity_table on (user_id)
    
    Login or Signup to reply.
  2. You can trade the window function for a clunky scalar subquery if you find it more familiar or intuitive. After all, from each Sign-In row, you’re just looking ahead for maximum date for the same user, that’s before the next Sign-In: demo

    select user_id,
           date as sign_in_time,
           (select max(date) from test t2 
            where t2.user_id=t1.user_id
            and t2.date>=t1.date
            and category<>'Sign-In'
            and date<=(
                select coalesce(min(t3.date),'infinity'::timestamp) 
                from test t3 where category='Sign-In'
                and t3.date>t1.date and t3.user_id=t1.user_id)
           )-date as session_duration
    from test t1
    where category='Sign-In'
    order by 1,2;
    
    user_id sign_in_time session_duration
    100 2023-12-20 01:07:03+00 00:04:25
    100 2023-12-21 17:37:31+00 null
    107 2023-12-20 01:09:19+00 00:00:31
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search