I’ve got a dataset (in postgres specifically) that I need to calculate the ‘user session time’ for. I’m fairly positive I need to use partitioning to answer the question, but I’m not really sure how to go about it. The events
table as an id, a user_id, the event category, and a date. When a user signs in, we put an event in the table for that user. Each time they visit a page, there’s a similar event. The table looks similar to this:
id | user_id | category | date |
---|---|---|---|
1 | 100 | Sign-In | 2023-12-20 01:07:03 |
2 | 100 | Page View | 2023-12-20 01:07:05 |
3 | 100 | Page View | 2023-12-20 01:09:17 |
4 | 107 | Sign-In | 2023-12-20 01:09:19 |
5 | 107 | Page View | 2023-12-20 01:09:21 |
6 | 107 | Page View | 2023-12-20 01:09:50 |
7 | 100 | Page View | 2023-12-20 01:11:28 |
.. | .. | .. | .. |
784 | 100 | Sign-In | 2023-12-21 17:37:31 |
To calculate the session time from user 100, let’s assume that event 7 is their last page view in the system for that session. We need to get the Sign-In event (id 1
), find the last event before the Sign-In (id 7
), and subtract the two times. That’d be 2023-12-20 01:11:28 - 2023-12-20 01:07:03
, or 4 minutes and 25 seconds
. I’d want to do this for each user/sign-in combination in the dataset. For completeness, it’s possible that the user doesn’t have any Page Views, in which case their session time is zero. The results should look something like this:
user_id | session_start | duration |
---|---|---|
100 | 2023-12-20 01:07:03 | 00:04:25 |
100 | 2023-12-21 17:37:31 | null (or 00:00:00) |
107 | 2023-12-20 01:09:19 | 00:00:31 |
My guess is that I want to partition by user, and then by the event category, but I need the row before the one selected potentially, and not the next window. In code, I’d normally just take all the events, order them by user then by duration, and start to build a new result set by just iterating through and creation a ‘session’ each time I have a sign-in event. I’m just curious if this could be done in SQL on the database more efficiently.
Thanks for any help or pointers!
2
Answers
if you assemble the queries in a CTE something like the following should work :-
You can trade the window function for a clunky scalar subquery if you find it more familiar or intuitive. After all, from each
Sign-In
row, you’re just looking ahead for maximum date for the same user, that’s before the nextSign-In
: demo