Postgresql - How to update many rows at once in parallel with postgres query?

tuskiomi
December 30, 2023
114 views
0 votes
2 Answers

I have the following postgres query that’s meant to update all message entries:

update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid

The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.

How do I speed up this simple update command to make it as fast as possible?

the type in the first clause is a UUID, and in the second is i64.

Answers

- ErwinBrandstetter
- December 22, 2023 at 6:24 am
- 0 votes
0
Parallel query plans only for SELECT

Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:
Even when it is in general possible for parallel query plans to be
generated, the planner will not generate them for a given query if any
of the following are true:
- The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within
  a CTE, no parallel plans for that query will be generated.
Manual parallelization

To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size – where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.

Create a PROCEDURE like:
```
CREATE OR REPLACE PROCEDURE public.f_upd_message(
     _lower_incl bigint
   , _upper_excl bigint
   , _step int = 50000
   )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   _low bigint;
   _upd_ct int;
BEGIN
   IF _upper_excl <= _lower_incl OR _step < 1 THEN
      RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
                     , _lower_incl, _upper_excl, _step;
   END IF;
   
   FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
   LOOP
      RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
                  , _low, LEAST(_upper_excl, _low + _step);  -- optional

      UPDATE public.message m
      SET    user_id = u.id
      FROM   public."user" u
      WHERE  m.user_pgid >= _low
      AND    m.user_pgid <  _low + _step
      AND    m.user_pgid <  _upper_excl  -- enforce upper bound
      AND    u.user_pgid = m.user_pgid
      AND    m.user_id <> u.id;          -- ① suppress empty updates
      
      GET DIAGNOSTICS upd_ct = ROW_COUNT;  -- optional

      COMMIT;

      RAISE NOTICE 'Updated % rows', upd_ct;  -- optional
   END LOOP;
END
$proc$;
```
Call:
```
CALL public.f_upd_message(20000000, 30000000);
```
Or:
```
CALL public.f_upd_message(20000000, 30000000, 100000);
```
① Avoid empty updates (where the column value wouldn’t change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
Also prevents repeated updates in case you have to start over or mess up slices. See:
- How do I (or can I) SELECT DISTINCT on multiple columns?
Base your slices on actual min and max user_pgid:
```
SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM   public.message;
```
Adjust the _step size to your system. I added a default of 50000.

Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).

Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.

Related:
- Does a procedure run in a transaction in PostgreSQL?
- How to force COMMIT inside function so other sessions can see updated row?
Side effects, notes

The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards – if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse …
See:
- VACUUM returning disk space to operating system
Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, … But you absolutely need an index on message(user_pgid) for this!

If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.
Login or Signup to reply.

- NamsiLydia
- December 22, 2023 at 7:19 pm
- 0 votes
0
To speed up the update operation on a large table, you can try a few techniques to optimize the query:

Batch Processing
Rather than updating all records in a single transaction, you can update them in smaller batches. This reduces the impact on system resources and helps avoid timeouts.

sample example implementation of batch processing :
```
DO $$ 
DECLARE 
   batch_size integer := 30000; -- Adjust the batch size as needed
BEGIN
   FOR i IN 0..(SELECT ceil(count(*)::numeric / batch_size) FROM message) LOOP
      UPDATE message m 
      SET user_id = u.id 
      FROM "user" u 
      WHERE u.user_pgid = m.user_pgid 
      LIMIT batch_size 
      OFFSET i * batch_size;
   END LOOP;
END $$;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to update many rows at once in parallel with postgres query?

Answers

Parallel query plans only for `SELECT`

Manual parallelization

Side effects, notes

Postgresql – How to update many rows at once in parallel with postgres query?

Answers

Parallel query plans only for SELECT

Manual parallelization

Side effects, notes

Parallel query plans only for `SELECT`