I need to track the sequence of sites visited by a user in their session. My data looks like this:
visitor | session_num | site | page_view_num | timestamp |
---|---|---|---|---|
User A | 2 | Site A | 1 | 2024-03-22 11:41:10 |
User A | 2 | Site A | 2 | 2024-03-22 11:41:14 |
User A | 2 | Site A | 3 | 2024-03-22 11:41:16 |
User A | 2 | Site B | 4 | 2024-03-22 11:41:47 |
User A | 2 | Site B | 5 | 2024-03-22 11:42:18 |
User A | 2 | Site B | 6 | 2024-03-22 11:42:19 |
User A | 2 | Site C | 7 | 2024-03-22 11:42:55 |
User A | 2 | Site C | 8 | 2024-03-22 11:43:18 |
User A | 2 | Site C | 9 | 2024-03-22 11:43:31 |
User A | 2 | Site A | 10 | 2024-03-22 11:52:19 |
User A | 2 | Site A | 11 | 2024-03-22 10:52:21 |
User B | 4 | Site B | 1 | 2024-03-25 10:52:30 |
User B | 4 | Site B | 2 | 2024-03-25 10:52:40 |
User B | 4 | Site A | 3 | 2024-03-25 10:53:50 |
User B | 4 | Site B | 4 | 2024-03-25 10:54:19 |
I am trying to achieve something like this (**site_num **column)
visitor | session_num | site | page_view_num | site_num | timestamp |
---|---|---|---|---|---|
User A | 2 | Site A | 1 | 1 | 2024-03-22 11:41:10 |
User A | 2 | Site A | 2 | 1 | 2024-03-22 11:41:14 |
User A | 2 | Site A | 3 | 1 | 2024-03-22 11:41:16 |
User A | 2 | Site B | 4 | 2 | 2024-03-22 11:41:47 |
User A | 2 | Site B | 5 | 2 | 2024-03-22 11:42:18 |
User A | 2 | Site B | 6 | 2 | 2024-03-22 11:42:19 |
User A | 2 | Site C | 7 | 3 | 2024-03-22 11:42:55 |
User A | 2 | Site C | 8 | 3 | 2024-03-22 11:43:18 |
User A | 2 | Site C | 9 | 3 | 2024-03-22 11:43:31 |
User A | 2 | Site A | 10 | 4 | 2024-03-22 11:52:19 |
User A | 2 | Site A | 11 | 4 | 2024-03-22 10:52:21 |
User B | 4 | Site B | 1 | 1 | 2024-03-25 10:52:30 |
User B | 4 | Site B | 2 | 1 | 2024-03-25 10:52:40 |
User B | 4 | Site A | 3 | 2 | 2024-03-25 10:53:50 |
User B | 4 | Site B | 4 | 3 | 2024-03-25 10:54:19 |
Essentially visits to each site should be grouped sequentially (partitioned by visitor, session_num) – I’ve tried various window functions but I’m struggling with cases when a user goes back to a site they have already visited early in the session, which needs to be counted as a separate instance
e.g. for User A the sequence is: Site A (site_num=1) > Site B (site_num=2) > Site C (site_num=3) > Site A (site_num=4)
2
Answers
This is a Gaps & Island problem. You can use the traditional solution:
Result:
See running example at db<>fiddle.
In PostgreSQL, see the solution by
@The Impaler
.In Vertica, you should use the
CONDITIONAL_TRUE_EVENT()
OLAP function, which is Vertica’s means to "sessionize" your time series data. (where "sessionizing" and "solving the gaps-and-islands question" mean the same).Without that function, you need to nest two OLAP queries into each other. The inner has a counter that is at 0 when the visitor is the same as before, and at 1 when the visitor changes. The outer selects from the inner, creating the running sum of the counter obtained previously.
I get exactly your result if I change the timestamp
2024-03-22 10:52:21
to2024-03-22 11:52:21
– which would be consistent with the rules you state in the description.And I’m using the named window , which you could also use in PostgreSQL, and add the
CONDITIONAL_CHANGE_EVENT()
function, which you could use alternatively.