skip to Main Content

I have the following table:

CREATE TABLE trajectory(
    user_id int, 
    session_id int, 
    timestamp timestamp with time zone,
    lat double precision, 
    lon double precision
);

INSERT INTO trajectory(user_id, session_id, timestamp, lat, lon) VALUES 
(1, 25304,'2008-10-23 02:53:04+01', 39.984702, 116.318417),
(1, 25304, '2008-10-23 02:53:10+01', 39.984683, 116.31845), 
(1, 25304, '2008-10-23 02:53:15+01', 39.984686, 116.318417),
(1, 25304, '2008-10-23 02:53:20+01', 39.984688, 116.318385), 
(1, 20959,'2008-10-24 02:09:59+01', 40.008304, 116.319876),
(1, 20959,'2008-10-24 02:10:04+01', 40.008413, 116.319962), 
(1, 20959,'2008-10-24 02:10:14+01', 40.007171, 116.319458),
(2, 55305, '2008-10-23 05:53:05+01', 39.984094, 116.319236), 
(2, 55305, '2008-10-23 05:53:11+01', 39.984198, 116.319322), 
(2, 55305, '2008-10-23 05:53:21+01', 39.984224, 116.319402), 
(2, 34104, '2008-10-23 23:41:04+01', 40.013867, 116.306473),
(2, 34104, '2008-10-23 23:41:16+01', 40.013907, 116.306488);

Question:

I want to add a column trip_id to this table, based on the user_id and session_id columns, so that when the user’s session id changes, I know the user is on a new trip so I add that id to the new trip column.

Required output:

user_id |session_id  |timestamp              |    lat       |   lon     | trip_id
--------|------------|-----------------------|--------------|-----------|-----------
  1     |     25304  |2008-10-23 02:53:04+01 | 39.984702    |116.318417 |       1
  1     |     25304  |2008-10-23 02:53:10+01 | 39.984683    |116.31845  |       1
  1     |     25304  |2008-10-23 02:53:15+01 | 39.984686    |116.318417 |       1
  1     |     25304  |2008-10-23 02:53:20+01 | 39.984688    |116.318385 |       1
  1     |     20959  |2008-10-24 02:09:59+01 |40.008304     |116.319876 |       2
  1     |     20959  |2008-10-24 02:10:04+01 |40.008413     |116.319962 |       2
  1     |     20959  |2008-10-24 02:10:14+01 |40.007171     |116.319458 |       2
  2     |     55305  |2008-10-23 05:53:05+01 |39.984094     |116.319236 |       1
  2     |     55305  |2008-10-23 05:53:11+01 |39.984198     |116.319322 |       1
  2     |     55305  |2008-10-23 05:53:21+01 |39.984224     |116.319402 |       1
  2     |     34104  |2008-10-23 23:41:04+01 |40.013867     |116.306473 |       2
  2     |     34104  |2008-10-23 23:41:16+01 |40.013907     |116.306488 |       2

How can I do this?

EDIT

Thanks for these great answers, but all the answers received are kind of retrieving table values, they do not modify the table. Plus, I added the timestamp column that I initially omitted thinking that would simplify my question.

3

Answers


  1. We can approach this with a gaps-and-island technique. Assuming that you have a column to order your dataset, say ordering_id:

    select t.*, 
        count(*) 
            filter(where session_id is distinct from lag_session_id) 
            over(partition by user_id order by ordering_id) trip_id
    from (
        select t.*, 
            lag(session_id) over(partition by user_id order by ordering_id) lag_session_id
        from trajectory t
    ) t
    

    The idea is to bring the "preceding" session id of the same user with lag, then to compare it with the value on the current row; we can then count how many times it changed, which defines the trip id.

    user_id session_id lat lon ordering_id lag_session_id trip_id
    1 25304 39.984702 116.318417 1 null 1
    1 25304 39.984683 116.31845 2 25304 1
    1 25304 39.984686 116.318417 3 25304 1
    1 25304 39.984688 116.318385 4 25304 1
    1 20959 40.008304 116.319876 5 25304 2
    1 20959 40.008413 116.319962 6 20959 2
    1 20959 40.007171 116.319458 7 20959 2
    2 55305 39.984094 116.319236 8 null 1
    2 55305 39.984198 116.319322 9 55305 1
    2 55305 39.984224 116.319402 10 55305 1
    2 34104 40.013867 116.306473 11 55305 2
    2 34104 40.013907 116.306488 12 34104 2

    fiddle

    Login or Signup to reply.
  2. Yet another approach would involve:

    • selecting the minimum record of your identifier
    • applying a DENSE RANK by ordering on it.

    If you don’t have any ordering field, you can use "ctid" for that purpose: it’s a field that stores the physical location of each row within the table. Otherwise use the timestamp field in place of "ctid".

    ALTER TABLE trajectory ADD COLUMN trip_id INT;
    
    WITH cte AS (
        SELECT ctid, 
               user_id, 
               MIN(ctid) OVER(PARTITION BY user_id, session_id) AS parts 
        FROM trajectory
    ), cte2 AS (
        SELECT ctid,
               DENSE_RANK() OVER(PARTITION BY user_id ORDER BY parts) AS trip_id 
        FROM cte
    )
    UPDATE trajectory 
    SET trip_id = cte2.trip_id
    FROM cte2
    WHERE trajectory.ctid = cte2.ctid;
    

    Output:

    user_id session_id lat lon trip_id
    1 25304 39.984702 116.318417 1
    1 25304 39.984683 116.31845 1
    1 25304 39.984686 116.318417 1
    1 25304 39.984688 116.318385 1
    1 20959 40.008304 116.319876 2
    1 20959 40.008413 116.319962 2
    1 20959 40.007171 116.319458 2
    2 55305 39.984094 116.319236 1
    2 55305 39.984198 116.319322 1
    2 55305 39.984224 116.319402 1
    2 34104 40.013867 116.306473 2
    2 34104 40.013907 116.306488 2

    Check the demo here.

    Login or Signup to reply.
  3. This can be accomplished by using the window function lag() to retrieve the previous row and determine whether or not it has changed, followed by the window function sum() to retrieve the trip_id:

    with cte as (
      select *, case when 
                     session_id - lag(session_id, 1, session_id) 
                                  over (partition by user_id order by timestamp) = 0 
                     then 0 else 1 end as diff
      from trajectory
    )
    select user_id, session_id, timestamp,  lat, lon, 1 + sum(diff) over (partition by user_id order by timestamp) as trip_id
    from cte
    order by user_id, timestamp
    

    Assuming the user_id/timestamp is unique, the update can be :

    ALTER TABLE trajectory ADD COLUMN trip_id int;
    
    with cte as (
      select *, case when session_id - lag(session_id, 1, session_id) over (partition by user_id order by timestamp) = 0 then 0 else 1 end as diff
      from trajectory
    ),
    cte2 as (
      select user_id,   session_id, timestamp,  lat, lon, 1 + sum(diff) over (partition by user_id order by timestamp) as trip_id
      from cte
      order by user_id, timestamp
    )
    UPDATE trajectory 
    SET trip_id = cte2.trip_id
    FROM cte2
    WHERE trajectory.timestamp = cte2.timestamp and trajectory.user_id = cte2.user_id
    

    Result :

    user_id session_id timestamp lat lon trip_id
    1 25304 2008-10-23 02:53:04+01 39.984702 116.318417 1
    1 25304 2008-10-23 02:53:10+01 39.984683 116.31845 1
    1 25304 2008-10-23 02:53:15+01 39.984686 116.318417 1
    1 25304 2008-10-23 02:53:20+01 39.984688 116.318385 1
    1 20959 2008-10-24 02:09:59+01 40.008304 116.319876 2
    1 20959 2008-10-24 02:10:04+01 40.008413 116.319962 2
    1 20959 2008-10-24 02:10:14+01 40.007171 116.319458 2
    2 55305 2008-10-23 05:53:05+01 39.984094 116.319236 1
    2 55305 2008-10-23 05:53:11+01 39.984198 116.319322 1
    2 55305 2008-10-23 05:53:21+01 39.984224 116.319402 1
    2 34104 2008-10-23 23:41:04+01 40.013867 116.306473 2
    2 34104 2008-10-23 23:41:16+01 40.013907 116.306488 2

    Demo here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search