skip to Main Content

I need to track the sequence of sites visited by a user in their session. My data looks like this:

visitor session_num site page_view_num timestamp
User A 2 Site A 1 2024-03-22 11:41:10
User A 2 Site A 2 2024-03-22 11:41:14
User A 2 Site A 3 2024-03-22 11:41:16
User A 2 Site B 4 2024-03-22 11:41:47
User A 2 Site B 5 2024-03-22 11:42:18
User A 2 Site B 6 2024-03-22 11:42:19
User A 2 Site C 7 2024-03-22 11:42:55
User A 2 Site C 8 2024-03-22 11:43:18
User A 2 Site C 9 2024-03-22 11:43:31
User A 2 Site A 10 2024-03-22 11:52:19
User A 2 Site A 11 2024-03-22 10:52:21
User B 4 Site B 1 2024-03-25 10:52:30
User B 4 Site B 2 2024-03-25 10:52:40
User B 4 Site A 3 2024-03-25 10:53:50
User B 4 Site B 4 2024-03-25 10:54:19

I am trying to achieve something like this (**site_num **column)

visitor session_num site page_view_num site_num timestamp
User A 2 Site A 1 1 2024-03-22 11:41:10
User A 2 Site A 2 1 2024-03-22 11:41:14
User A 2 Site A 3 1 2024-03-22 11:41:16
User A 2 Site B 4 2 2024-03-22 11:41:47
User A 2 Site B 5 2 2024-03-22 11:42:18
User A 2 Site B 6 2 2024-03-22 11:42:19
User A 2 Site C 7 3 2024-03-22 11:42:55
User A 2 Site C 8 3 2024-03-22 11:43:18
User A 2 Site C 9 3 2024-03-22 11:43:31
User A 2 Site A 10 4 2024-03-22 11:52:19
User A 2 Site A 11 4 2024-03-22 10:52:21
User B 4 Site B 1 1 2024-03-25 10:52:30
User B 4 Site B 2 1 2024-03-25 10:52:40
User B 4 Site A 3 2 2024-03-25 10:53:50
User B 4 Site B 4 3 2024-03-25 10:54:19

Essentially visits to each site should be grouped sequentially (partitioned by visitor, session_num) – I’ve tried various window functions but I’m struggling with cases when a user goes back to a site they have already visited early in the session, which needs to be counted as a separate instance

e.g. for User A the sequence is: Site A (site_num=1) > Site B (site_num=2) > Site C (site_num=3) > Site A (site_num=4)

2

Answers


  1. This is a Gaps & Island problem. You can use the traditional solution:

    select *, sum(i) over(partition by visitor order by page_view_num) as site_num
    from (
      select *, 
        case when site = lag(site) over(partition by visitor order by page_view_num)
             then 0 else 1 end as i
      from t
    ) x
    

    Result:

     visitor  session_num  site    page_view_num  timestamp            i  site_num 
     -------- ------------ ------- -------------- -------------------- -- -------- 
     User A   2            Site A  1              2024-03-22 11:41:10  1  1        
     User A   2            Site A  2              2024-03-22 11:41:14  0  1        
     User A   2            Site A  3              2024-03-22 11:41:16  0  1        
     User A   2            Site B  4              2024-03-22 11:41:47  1  2        
     User A   2            Site B  5              2024-03-22 11:42:18  0  2        
     User A   2            Site B  6              2024-03-22 11:42:19  0  2        
     User A   2            Site C  7              2024-03-22 11:42:55  1  3        
     User A   2            Site C  8              2024-03-22 11:43:18  0  3        
     User A   2            Site C  9              2024-03-22 11:43:31  0  3        
     User A   2            Site A  10             2024-03-22 11:52:19  1  4        
     User A   2            Site A  11             2024-03-22 10:52:21  0  4        
     User B   4            Site B  1              2024-03-25 10:52:30  1  1        
     User B   4            Site B  2              2024-03-25 10:52:40  0  1        
     User B   4            Site A  3              2024-03-25 10:53:50  1  2        
     User B   4            Site B  4              2024-03-25 10:54:19  1  3        
    

    See running example at db<>fiddle.

    Login or Signup to reply.
  2. In PostgreSQL, see the solution by @The Impaler.

    In Vertica, you should use the CONDITIONAL_TRUE_EVENT() OLAP function, which is Vertica’s means to "sessionize" your time series data. (where "sessionizing" and "solving the gaps-and-islands question" mean the same).

    Without that function, you need to nest two OLAP queries into each other. The inner has a counter that is at 0 when the visitor is the same as before, and at 1 when the visitor changes. The outer selects from the inner, creating the running sum of the counter obtained previously.

    I get exactly your result if I change the timestamp 2024-03-22 10:52:21 to 2024-03-22 11:52:21 – which would be consistent with the rules you state in the description.

    And I’m using the named window , which you could also use in PostgreSQL, and add the CONDITIONAL_CHANGE_EVENT() function, which you could use alternatively.

    WITH
    -- your input
    indata(visitor,session_num,site,page_view_num,timestamp) AS (
              SELECT 'User A',2,'Site A', 1,TIMESTAMP '2024-03-22 11:41:10'
    UNION ALL SELECT 'User A',2,'Site A', 2,TIMESTAMP '2024-03-22 11:41:14'
    UNION ALL SELECT 'User A',2,'Site A', 3,TIMESTAMP '2024-03-22 11:41:16'
    UNION ALL SELECT 'User A',2,'Site B', 4,TIMESTAMP '2024-03-22 11:41:47'
    UNION ALL SELECT 'User A',2,'Site B', 5,TIMESTAMP '2024-03-22 11:42:18'
    UNION ALL SELECT 'User A',2,'Site B', 6,TIMESTAMP '2024-03-22 11:42:19'
    UNION ALL SELECT 'User A',2,'Site C', 7,TIMESTAMP '2024-03-22 11:42:55'
    UNION ALL SELECT 'User A',2,'Site C', 8,TIMESTAMP '2024-03-22 11:43:18'
    UNION ALL SELECT 'User A',2,'Site C', 9,TIMESTAMP '2024-03-22 11:43:31'
    UNION ALL SELECT 'User A',2,'Site A',10,TIMESTAMP '2024-03-22 11:52:19'
    UNION ALL SELECT 'User A',2,'Site A',11,TIMESTAMP '2024-03-22 11:52:21'
    UNION ALL SELECT 'User B',4,'Site B', 1,TIMESTAMP '2024-03-25 10:52:30'
    UNION ALL SELECT 'User B',4,'Site B', 2,TIMESTAMP '2024-03-25 10:52:40'
    UNION ALL SELECT 'User B',4,'Site A', 3,TIMESTAMP '2024-03-25 10:53:50'
    UNION ALL SELECT 'User B',4,'Site B', 4,TIMESTAMP '2024-03-25 10:54:19'
    )
    -- real query starts here, replace following comma with "WITH"
    SELECT
      visitor
    , session_num
    , site
    , page_view_num
    , CONDITIONAL_TRUE_EVENT(LAG(site,1,'site?')<> site) OVER w AS site_num
    , CONDITIONAL_CHANGE_EVENT(site) OVER w + 1 AS site_num_chg
    , timestamp
    FROM indata
    WINDOW w AS (PARTITION BY visitor ORDER BY timestamp) 
    ORDER BY visitor,timestamp;
    
    visitor session_num site page_view_num site_num site_num_chg timestamp
    User A 2 Site A 1 1 1 2024-03-22 11:41:10
    User A 2 Site A 2 1 1 2024-03-22 11:41:14
    User A 2 Site A 3 1 1 2024-03-22 11:41:16
    User A 2 Site B 4 2 2 2024-03-22 11:41:47
    User A 2 Site B 5 2 2 2024-03-22 11:42:18
    User A 2 Site B 6 2 2 2024-03-22 11:42:19
    User A 2 Site C 7 3 3 2024-03-22 11:42:55
    User A 2 Site C 8 3 3 2024-03-22 11:43:18
    User A 2 Site C 9 3 3 2024-03-22 11:43:31
    User A 2 Site A 10 4 4 2024-03-22 11:52:19
    User A 2 Site A 11 4 4 2024-03-22 11:52:21
    User B 4 Site B 1 1 1 2024-03-25 10:52:30
    User B 4 Site B 2 1 1 2024-03-25 10:52:40
    User B 4 Site A 3 2 2 2024-03-25 10:53:50
    User B 4 Site B 4 3 3 2024-03-25 10:54:19
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search