Postgresql - Which is fastest for importing large datasets that also need an update?

PurushottamNawale
December 21, 2024
97 views
0 votes
2 Answers

I have a large CSV file that I need to update into a PostgreSQL database. I am considering two different approaches and would like to know which one is faster for updating very large data. Here are the two approaches:

-- Approach 1: CREATE TABLE + UPDATE
BEGIN;
CREATE TEMP TABLE temp_data (
    id int,
    column1 text,
    column2 text
    -- other columns as needed
);

COPY temp_data FROM '/path/to/csv' WITH (FORMAT csv, HEADER true);

CREATE TABLE target_table AS 
SELECT column1 = column1*2, column2 = substring(column2, 1, 2) FROM existing_table;


DROP TABLE temp_data;
COMMIT;

-- Approach 2: INSERT + UPDATE
BEGIN;
CREATE TEMP TABLE temp_data (
    id int,
    column1 text,
    column2 text
    -- other columns as needed
);

COPY temp_data FROM '/path/to/csv' WITH (FORMAT csv, HEADER true);

INSERT INTO target_table (column1, column2)
SELECT column1 = column1*2, column2 = substring(column2, 1, 2)
FROM temp_data;

DROP TABLE temp_data;
COMMIT;

Which approach is generally faster for updating large datasets in PostgreSQL, and why? Are there any specific considerations or optimizations I should be aware of for either approach?

Answers

- Tometzky
- December 21, 2024 at 7:35 pm
- 0 votes
0
It does not matter here – both will be almost exactly the same.

I’d recommend the second approach though, as you have better options for optimizations in the future.

You can control the column types precisely, storage types, compression, collation etc.

You can create the table with check constraints you need before loading your data. Usually this kind of bulk insert is IO-bound so the CPU can be used for checking those constraints while loading, so you don’t have to add them afterwards, which would take additional time.

You can partition your table before loading your data.

You can load your data in parallel, which would be even faster if the table would be partitioned.

You can check progress by checking the table size while INSERT is running. You can’t do this while CREATE is running.

You can load your data with INSERT in chunks, so you can interrupt execution and then resume it without losing much of your progress.

Login or Signup to reply.

- ErwinBrandstetter
- December 21, 2024 at 8:37 pm
- 0 votes
0
Actually, there can be a substantial difference, and the faster option would be "approach 1". The reason being that it can largely skip writing to the WAL (Write Ahead Log) – while operating with wal_level at minimal. The manual:

The minimal level generates the least WAL volume. It logs no row information for permanent relations in transactions that create or rewrite them. This can make operations much faster (see Section 14.4.7).

Meaning only about half the write cost for your approach 1. Mostly just because you create the table in the same transaction. Splitting off the INSERT barely makes any difference.
I would consolidate to:
```
-- Approach 3: CREATE TABLE + INSERT in same transaction
-- best with: wal_level = minimal
BEGIN;

-- make sure temp_buffers is big enough to hold the complete temp table
CREATE TEMP TABLE temp_data (
  id int
, column1 text
, column2 text
 -- more
) ON COMMIT DROP;

COPY temp_data FROM '/path/to/csv' WITH (FORMAT csv, HEADER true);

CREATE TABLE target_table (  -- created in same transaction!
  id int
, column1 text
, column2 text
 -- more
);

INSERT INTO target_table
      (id, column1    , column2)  -- spell out target column list for persisted command
SELECT id, column1 * 2, left(column2, 2)
FROM   temp_data
-- ORDER  BY ...  order favorably while being at it?
;

COMMIT;
```
About temp_buffers:
- How to update selected rows with values from a CSV file in Postgres?
If you can connect to the input file with a foreign data wrapper directly (like Zegarek commented), you can skip the temp table. But that’s often not easily possible, and it doesn’t buy nearly as much as skipping WAL – while the temp table lives in RAM exclusively. Writes to persisted storage are typically much more expensive.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Which is fastest for importing large datasets that also need an update?

Answers