I have a large CSV file that I need to update into a PostgreSQL database. I am considering two different approaches and would like to know which one is faster for updating very large data. Here are the two approaches:
-- Approach 1: CREATE TABLE + UPDATE
BEGIN;
CREATE TEMP TABLE temp_data (
id int,
column1 text,
column2 text
-- other columns as needed
);
COPY temp_data FROM '/path/to/csv' WITH (FORMAT csv, HEADER true);
CREATE TABLE target_table AS
SELECT column1 = column1*2, column2 = substring(column2, 1, 2) FROM existing_table;
DROP TABLE temp_data;
COMMIT;
-- Approach 2: INSERT + UPDATE
BEGIN;
CREATE TEMP TABLE temp_data (
id int,
column1 text,
column2 text
-- other columns as needed
);
COPY temp_data FROM '/path/to/csv' WITH (FORMAT csv, HEADER true);
INSERT INTO target_table (column1, column2)
SELECT column1 = column1*2, column2 = substring(column2, 1, 2)
FROM temp_data;
DROP TABLE temp_data;
COMMIT;
Which approach is generally faster for updating large datasets in PostgreSQL, and why? Are there any specific considerations or optimizations I should be aware of for either approach?
2
Answers
It does not matter here – both will be almost exactly the same.
I’d recommend the second approach though, as you have better options for optimizations in the future.
You can control the column types precisely, storage types, compression, collation etc.
You can create the table with check constraints you need before loading your data. Usually this kind of bulk insert is IO-bound so the CPU can be used for checking those constraints while loading, so you don’t have to add them afterwards, which would take additional time.
You can partition your table before loading your data.
You can load your data in parallel, which would be even faster if the table would be partitioned.
You can check progress by checking the table size while INSERT is running. You can’t do this while CREATE is running.
You can load your data with INSERT in chunks, so you can interrupt execution and then resume it without losing much of your progress.
Actually, there can be a substantial difference, and the faster option would be "approach 1". The reason being that it can largely skip writing to the WAL (Write Ahead Log) – while operating with
wal_level
atminimal
. The manual:Meaning only about half the write cost for your approach 1. Mostly just because you create the table in the same transaction. Splitting off the
INSERT
barely makes any difference.I would consolidate to:
About
temp_buffers
:If you can connect to the input file with a foreign data wrapper directly (like Zegarek commented), you can skip the temp table. But that’s often not easily possible, and it doesn’t buy nearly as much as skipping WAL – while the temp table lives in RAM exclusively. Writes to persisted storage are typically much more expensive.