skip to Main Content

I have a table where I record events. These events will come into the system once a minute, per user – on the order of 10’s of thousands a minute. But I don’t need to save all of them.
If an event comes in within 90 seconds of the previous event, I want to update the previous row. If it’s been more than 90 seconds, I want to insert a new row.

Example table:

create table events (
  id serial,
  user_id int references users(id) not null,
  created_at timestamp not null default now(),
  updated_at timestamp not null default now(),
  ...some other event columns here...
);

create index idx_events_user_id_updated_at on events (user_id, updated_at desc);

And the pseudocode would be something like:

  1. Fetch last event for user
  2. If updated_at is within the last 90 seconds:
    1. Update row, including updated_at with now()
  3. Otherwise:
    1. Insert new row

Is there a way to do this with a single Postgres statement?

I’m aware of on conflict, but I don’t think it’ll work for this use-case. The (user_id, updated_at) pair could be defined as a unique constraint, which could be used to trigger an on conflict, but the timestamps are arbitrary. These events come in "every minute", but not exactly on a minute (or even exactly a minute apart, due to network latency, server latency, etc etc, hence using 90 seconds to give a 30 second buffer). Truncating the timestamps to the minute would reduce the feature’s usefulness, so I’d hate to do that just to handle upserts more cleanly.

2

Answers


  1. I think your best option is to create a stored procedure with an upsert block, i.e.:

    UPDATE ...
    IF NOT FOUND THEN 
      INSERT ...
    END IF
    

    Start by assuming that the record exists and do an UPDATE. Include a WHERE clause that checks if the record was updated less than 90 seconds ago.

    If nothing was updated it means that either there was no previous record or the previous record was updated more than 90 seconds ago – either way you need to do an INSERT

    From the clients perspective there will only be a single call (execute the procedure) and most of the time there will only be a single UPDATE on the server, however the second INSERT will be needed sometimes.

    Login or Signup to reply.
  2. Is there a way to do this with a single Postgres statement?

    Can be done with an UPSERT command. You need an EXCLUSION constraint on the timestamp range. For the leading integer column user_id, the additional module btree_gist must be installed. See:

    ALTER TABLE events
      ADD CONSTRAINT user_90sec
      EXCLUDE USING gist (user_id WITH =, tsrange(updated_at, updated_at + interval '90 sec') WITH &&);
    

    Notably, the same doesn’t work for timestamptz, because timestamptz + interval is only STABLE, while timestamp + interval is IMMUTABLE as required for the implied GiST index (or any index for that matter).

    Query:

    WITH input_rows(user_id, data) AS (
       VALUES
          (1, 'foo_new')  -- your input here
        , (2, 'bar_new')
        , (3, 'baz_new')
        -- more?
       )
    , ins AS (
       INSERT INTO events (user_id, data) 
       SELECT user_id, data FROM input_rows
       ON CONFLICT ON CONSTRAINT user_90sec DO NOTHING
       RETURNING user_id
       )
    UPDATE events e
    SET    updated_at = LOCALTIMESTAMP
         , data = i.data
    FROM   input_rows i
    LEFT   JOIN ins USING (user_id)
    WHERE  ins.user_id IS NULL
    AND    e.user_id = i.user_id
    AND    e.updated_at > LOCALTIMESTAMP - interval '90 sec';
    

    fiddle

    This assumes at most one row per user_id in the input.

    And no concurrent, competing writes. Else, there may be race conditions due to the unavoidable time gap between INSERT & UPDATE. (ON CONFLICT ... DO UPDATE is not allowed based on an exclusion constraint.)

    Also, no ambiguous data types in the input, or you must add an explicit cast. See:

    Depending on details of your case, there may well be faster solutions. Like, just write all incoming rows, and later consolidate in bulk operations.

    While sticking with this solution, it may be faster to add a generated column for the tsrange and base the exclusion constraint (and its implied GiST index) on that. Bloats the table row, though. See:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search