skip to Main Content

I am trying to add another column called session_id. I want to rank according to the condition that if the time difference between the date_time is more than 30 minutes, then that will be counted as another session. Here is an example of what I am trying to do:

date_diff date_time session_id
0 2023-01-18 00:01:40.000000 1
0 2023-01-18 00:01:42.000000 1
0 2023-01-18 00:01:46.000000 1
93 2023-01-18 01:34:38.000000 2
0 2023-01-18 01:34:38.000000 2
27 2023-01-18 02:01:59.000000 2
1 2023-01-18 02:02:00.000000 2
89 2023-01-18 03:31:40.000000 3

So whenever, date_diff in minutes is more than 30, that will be categorized as a new session.

3

Answers


  1. There might be a better way to do this in Redshift, which I don’t have, but you might try something like this:

    SELECT Datetime, date_diff, 
      SUM(CASE WHEN date_diff > 30 THEN 1 ELSE 0 END) OVER (ORDER BY Datetime) AS group_id
    FROM your_table
    

    This simply flags the rows > 30 with a 1, and then the OVER() clause will sort and sum which would create the ordered session_id you’re looking for.

    Login or Signup to reply.
  2. One option uses a conditional window sum:

    select t.*,
        1 + sum(case when date_diff > 30 then 1 else 0 end) 
            over(order by date_time) session_id
    from mytable
    

    If you wanted to compute the date difference on the fly from the timestamp column, we would use lag() first:

    select t.*,
        1 + sum(case when datediff(minute, lag_date_time, date_time) > 30 then 1 else 0 end) 
            over(order by date_time) session_id
    from (
        select t.*, lag(date_time, 1, date_time) over(order by date_time) lag_date_time
        from mytable t
    ) t
    
    Login or Signup to reply.
  3. You can achieve this using window functions in SQL. Assuming you have a table called activity with the columns date_diff and date_time, you can use the following query to calculate the session_id:

    WITH
      time_diffs AS (
        SELECT
          *,
          LAG(date_time) OVER (ORDER BY date_time) AS prev_date_time
        FROM
          activity
      ),
      flagged_sessions AS (
        SELECT
          *,
          CASE
            WHEN EXTRACT(EPOCH FROM (date_time - prev_date_time)) / 60 > 30 THEN 1
            ELSE 0
          END AS new_session_flag
        FROM
          time_diffs
      ),
      session_ids AS (
        SELECT
          *,
          SUM(new_session_flag) OVER (ORDER BY date_time) + 1 AS session_id
        FROM
          flagged_sessions
      )
    SELECT
      date_diff,
      date_time,
      session_id
    FROM
      session_ids
    ORDER BY
      date_time;

    In this query:

    We first calculate the time difference between the current row and the previous row using the LAG window function in the time_diffs CTE.
    Then, we create a new_session_flag column in the flagged_sessions CTE, which is 1 if the time difference is more than 30 minutes, and 0 otherwise.
    Finally, we calculate the session_id by taking the cumulative sum of the new_session_flag column, and adding 1 to it in the session_ids CTE.
    The final result is selected from the session_ids CTE and ordered by date_time.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search