I have the following postgres query that’s meant to update all message entries:
update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid
The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.
How do I speed up this simple update command to make it as fast as possible?
the type in the first clause is a UUID, and in the second is i64.
2
Answers
Parallel query plans only for
SELECT
Postgres only uses parallel plans for
SELECT
(and some commands based on aSELECT
, likeCREATE TABLE AS
.) Queries modifying any rows cannot be parallelized. The manual:Manual parallelization
To address what you asked: split up the table into N non-overlapping slices along
user_pgid
, roughly equal in size – where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.Create a
PROCEDURE
like:Call:
Or:
① Avoid empty updates (where the column value wouldn’t change). If
user_id
can be null, use null-safe comparison withIS DISTINCT FROM
.Also prevents repeated updates in case you have to start over or mess up slices. See:
Base your slices on actual min and max
user_pgid
:Adjust the
_step
size to your system. I added a default of 50000.Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).
Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.
Related:
Side effects, notes
The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run
VACUUM FULL
afterwards – if you can afford to do so. Alternatively, issueVACUUM
in the procedure in reasonable intervals to make space from dead tuples available for reuse …See:
Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, … But you absolutely need an index on
message(user_pgid)
for this!If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.
To speed up the update operation on a large table, you can try a few techniques to optimize the query:
Batch Processing
Rather than updating all records in a single transaction, you can update them in smaller batches. This reduces the impact on system resources and helps avoid timeouts.
sample example implementation of batch processing :