skip to Main Content

I have the following postgres query that’s meant to update all message entries:

update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid

The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.

How do I speed up this simple update command to make it as fast as possible?

the type in the first clause is a UUID, and in the second is i64.

2

Answers


  1. Parallel query plans only for SELECT

    Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:

    Even when it is in general possible for parallel query plans to be
    generated, the planner will not generate them for a given query if any
    of the following are true:

    • The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within
      a CTE, no parallel plans for that query will be generated.

    Manual parallelization

    To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size – where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.

    Create a PROCEDURE like:

    CREATE OR REPLACE PROCEDURE public.f_upd_message(
         _lower_incl bigint
       , _upper_excl bigint
       , _step int = 50000
       )
      LANGUAGE plpgsql AS
    $proc$
    DECLARE
       _low bigint;
       _upd_ct int;
    BEGIN
       IF _upper_excl <= _lower_incl OR _step < 1 THEN
          RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
                         , _lower_incl, _upper_excl, _step;
       END IF;
       
       FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
       LOOP
          RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
                      , _low, LEAST(_upper_excl, _low + _step);  -- optional
    
          UPDATE public.message m
          SET    user_id = u.id
          FROM   public."user" u
          WHERE  m.user_pgid >= _low
          AND    m.user_pgid <  _low + _step
          AND    m.user_pgid <  _upper_excl  -- enforce upper bound
          AND    u.user_pgid = m.user_pgid
          AND    m.user_id <> u.id;          -- ① suppress empty updates
          
          GET DIAGNOSTICS upd_ct = ROW_COUNT;  -- optional
    
          COMMIT;
    
          RAISE NOTICE 'Updated % rows', upd_ct;  -- optional
       END LOOP;
    END
    $proc$;
    

    Call:

    CALL public.f_upd_message(20000000, 30000000);
    

    Or:

    CALL public.f_upd_message(20000000, 30000000, 100000);
    

    ① Avoid empty updates (where the column value wouldn’t change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
    Also prevents repeated updates in case you have to start over or mess up slices. See:

    Base your slices on actual min and max user_pgid:

    SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
    FROM   public.message;
    

    Adjust the _step size to your system. I added a default of 50000.

    Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).

    Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.

    Related:

    Side effects, notes

    The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards – if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse …
    See:

    Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, … But you absolutely need an index on message(user_pgid) for this!

    If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.

    Login or Signup to reply.
  2. To speed up the update operation on a large table, you can try a few techniques to optimize the query:

    Batch Processing
    Rather than updating all records in a single transaction, you can update them in smaller batches. This reduces the impact on system resources and helps avoid timeouts.

    sample example implementation of batch processing :

    DO $$ 
    DECLARE 
       batch_size integer := 30000; -- Adjust the batch size as needed
    BEGIN
       FOR i IN 0..(SELECT ceil(count(*)::numeric / batch_size) FROM message) LOOP
          UPDATE message m 
          SET user_id = u.id 
          FROM "user" u 
          WHERE u.user_pgid = m.user_pgid 
          LIMIT batch_size 
          OFFSET i * batch_size;
       END LOOP;
    END $$;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search