What is the avg session length in PostgresSQL?

Tical
May 18, 2023
243 views
0 votes
2 Answers

I need some help. I have a table in PostgreSQL with logs of user sessions.
The table has 3 fields: id, action, and action_date in the timestamp format.
The action field has 2 possible values: "page_load" or "page_exit".
The action_date field contains the start time and end time of each session, respectively.
I need to calculate the average session length for each user who visited the website.
The problem is that for some sessions (probably due to errors in data recording), the action field contains a different number of "page_load" or "page_exit" values for the same user (id). Additionally, there can be multiple sessions for the same user within one day.
There are also cases where two consecutive "page_load" actions occur with a small time difference, followed by a "page_exit" action, indicating the end of the second "page_load" session:

8658    page_load   2023-02-01 21:58:44.000
8658    page_load   2023-02-02 08:31:32.000
8658    page_exit   2017-01-28 08:42:18.000

Due to the varying number of "page_load" or "page_exit" actions for each user, I’m having trouble performing the calculation.
I’ve tried creating separate datasets by dividing them based on the session type and then performing a join, but it’s incorrect due to sessions with different counts of "page_load" and "page_exit" actions.
I also attempted to use the lead() function, but I get incorrect calculations due to cases where "page_load" or "page_exit" actions can occur consecutively.
I’ve tried filtering out the rows where the count of "page_load" or "page_exit" actions doesn’t match, but I end up with a shift in the data and incorrect calculations as well.
Is there any way to eliminate these extra rows and perform calculations only with matching pairs? I’ve run out of ideas, and I’m hoping for possible assistance from the community.
Here is one of the approaches I’m trying to eliminate rows where "page_load" and "page_exit" don’t match, in order to obtain two separate columns that I can later use to calculate the average length. However, this query is incorrect because I’m getting a significant difference between "page_load" and "page_exit" that doesn’t exist in the original data.

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

Another attempt to calculate using offsets, but it also returns incorrect data due to rows that lack either "page_load" or "page_exit":

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = 'page_load' ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

I have tried several different approaches, but none of them have yielded the correct result. I believe that the only thing that can help solve this problem is to pivot the rows and create two separate fields, start_ses (timestamp) and end_ses (timestamp). However, my attempts to write such a solution have not produced the correct result so far. I hope someone can provide guidance or point me towards the right solution for this problem.

Answers

- GMB
- May 17, 2023 at 11:07 pm
- 0 votes
0
You do not exactly tell how you want to handle "outlier" rows. A simple approach is to use lead() or lag() to identify "legitimate" rows, ie a page load immediately followed by a page exit. We would then ignore all other rows, and compute the average of those pairs.
```
select id, avg(action_date - lag_action_date) avg_session_duration
from (
    select e.*,
        lag(action)      over(partition by id order by action_date) lag_action,
        lag(action_date) over(partition by id order by action_date) lag_action_date
    from ses_events e
) e
where action = 'page_logout' and lag_action = 'page_login'
group by id
```
Login or Signup to reply.

You typically want to instoduce a cleaning view in such case containing still the detailed (not aggregated) data, but providing some additional information helping to classify the data.

An example that classifies the terminating records in OK if a previous record is the opening record and calculates the session duration.

with session_view as (
select
  id, "action", action_date,
  lag(action_date) over (partition by id order by action_date) action_date_lag,
  lag("action") over (partition by id order by action_date) action_lag
from dt)
select 
  id, "action", action_date,
  case 
    when "action"  = 'page_load' then null /* session start */
    when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else 'illegal session' end as session_status,
  case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date

id  |action   |action_date        |session_status |session_duration|
----+---------+-------------------+---------------+----------------+
8658|page_exit|2017-01-28 08:42:18|illegal session|                |
8658|page_load|2023-02-01 21:58:44|               |                |
8658|page_exit|2023-02-01 22:58:44|OK             |        01:00:00|
8658|page_load|2023-02-02 08:31:32|               |                |
8658|page_exit|2023-02-02 08:41:32|OK             |        00:10:00|

You will consider only the sessions with the status OK in the average calculation.

Additionaly you may check the duration and discard outliers using soem threshold for too long or too short sessions.

Alternatively use median which is less error dependent on the outliers.

You may want to use EXTRACT(epoch to calculate the duration in seconds or other units.

Please signup or login to give your own answer.

Click here to cancel reply.