skip to Main Content

I need some help. I have a table in PostgreSQL with logs of user sessions.
The table has 3 fields: id, action, and action_date in the timestamp format.
The action field has 2 possible values: "page_load" or "page_exit".
The action_date field contains the start time and end time of each session, respectively.
I need to calculate the average session length for each user who visited the website.
The problem is that for some sessions (probably due to errors in data recording), the action field contains a different number of "page_load" or "page_exit" values for the same user (id). Additionally, there can be multiple sessions for the same user within one day.
There are also cases where two consecutive "page_load" actions occur with a small time difference, followed by a "page_exit" action, indicating the end of the second "page_load" session:

8658    page_load   2023-02-01 21:58:44.000
8658    page_load   2023-02-02 08:31:32.000
8658    page_exit   2017-01-28 08:42:18.000

Due to the varying number of "page_load" or "page_exit" actions for each user, I’m having trouble performing the calculation.
I’ve tried creating separate datasets by dividing them based on the session type and then performing a join, but it’s incorrect due to sessions with different counts of "page_load" and "page_exit" actions.
I also attempted to use the lead() function, but I get incorrect calculations due to cases where "page_load" or "page_exit" actions can occur consecutively.
I’ve tried filtering out the rows where the count of "page_load" or "page_exit" actions doesn’t match, but I end up with a shift in the data and incorrect calculations as well.
Is there any way to eliminate these extra rows and perform calculations only with matching pairs? I’ve run out of ideas, and I’m hoping for possible assistance from the community.
Here is one of the approaches I’m trying to eliminate rows where "page_load" and "page_exit" don’t match, in order to obtain two separate columns that I can later use to calculate the average length. However, this query is incorrect because I’m getting a significant difference between "page_load" and "page_exit" that doesn’t exist in the original data.

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

Another attempt to calculate using offsets, but it also returns incorrect data due to rows that lack either "page_load" or "page_exit":

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = 'page_load' ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

I have tried several different approaches, but none of them have yielded the correct result. I believe that the only thing that can help solve this problem is to pivot the rows and create two separate fields, start_ses (timestamp) and end_ses (timestamp). However, my attempts to write such a solution have not produced the correct result so far. I hope someone can provide guidance or point me towards the right solution for this problem.

2

Answers


  1. You do not exactly tell how you want to handle "outlier" rows. A simple approach is to use lead() or lag() to identify "legitimate" rows, ie a page load immediately followed by a page exit. We would then ignore all other rows, and compute the average of those pairs.

    select id, avg(action_date - lag_action_date) avg_session_duration
    from (
        select e.*,
            lag(action)      over(partition by id order by action_date) lag_action,
            lag(action_date) over(partition by id order by action_date) lag_action_date
        from ses_events e
    ) e
    where action = 'page_logout' and lag_action = 'page_login'
    group by id
    
    Login or Signup to reply.
  2. You typically want to instoduce a cleaning view in such case containing still the detailed (not aggregated) data, but providing some additional information helping to classify the data.

    An example that classifies the terminating records in OK if a previous record is the opening record and calculates the session duration.

    with session_view as (
    select
      id, "action", action_date,
      lag(action_date) over (partition by id order by action_date) action_date_lag,
      lag("action") over (partition by id order by action_date) action_lag
    from dt)
    select 
      id, "action", action_date,
      case 
        when "action"  = 'page_load' then null /* session start */
        when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else 'illegal session' end as session_status,
      case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
    from session_view
    order by id, action_date
    
    id  |action   |action_date        |session_status |session_duration|
    ----+---------+-------------------+---------------+----------------+
    8658|page_exit|2017-01-28 08:42:18|illegal session|                |
    8658|page_load|2023-02-01 21:58:44|               |                |
    8658|page_exit|2023-02-01 22:58:44|OK             |        01:00:00|
    8658|page_load|2023-02-02 08:31:32|               |                |
    8658|page_exit|2023-02-02 08:41:32|OK             |        00:10:00|
    

    You will consider only the sessions with the status OK in the average calculation.

    Additionaly you may check the duration and discard outliers using soem threshold for too long or too short sessions.

    Alternatively use median which is less error dependent on the outliers.

    You may want to use EXTRACT(epoch to calculate the duration in seconds or other units.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search