Postgresql - How can I reduce the time it takes to insert large amount of row into a postgres database table (100Mil rows) using the copy command COPY csv INTO db

Ashahmali
August 30, 2023
302 views
0 votes
2 Answers

I need to insert about 100 million rows of flat data into a Postgres database on a schedule with a frequency of 24 hours. I.e. every 24 hours, I have an ingestor that inserts data from CSV files in the database.

I don’t know in advance what rows in the data will have changed on every load.
The data in this database gets surfaced/published in an API endpoint which should not have downtime even during reloading the data.

Implementation:
My current approach does the following

I have created 2 indexes; 1 based on an int and a varchar column, and the other based on 2 varchar columns; these indeces are needed to improve the "time to first byte" of the API endpoints.

then

–> Start a transaction

–> Delete all rows from the table

–> Insert all 100 million rows using the Postgres CSV COPY command

–> Commit transaction.

I am doing the operations in a transaction because; the API is expected to be available regardless of data being reloaded.

The issue with this approach is that; the first time it loads, (empty database, empty indexes), is relatively fast (takes about 2 Hours) but subsequent times take an average of 36 hours.

Can anyone please suggest what the bottleneck might be and maybe another approach to solve this problem?

I suspect the indexes are getting in the way of subsequent inserts.

Tags: postgresql sql-insert

Answers

- LaurenzAlbe
- August 30, 2023 at 1:31 pm
- 0 votes
0
You should definitely TRUNCATE, not DELETE, if you want to empty the table. DELETE just marks the rows as invalid.

For further massive speed gains, you’d have to set wal_level = minimal. Then such a transaction can avoid writing WAL. This may not be feasible, however: it requires a restart and will break replication and archive recovery-

Login or Signup to reply.

- jjanes
- August 30, 2023 at 4:41 pm
- 0 votes
0
I suspect the indexes are getting in the way of subsequent inserts.

Sure, that is plausible. There is no way we can tell from the info you have given us. Once the indexes are too large to stay in cache, maintaining them will suck. If the indexes started out fitting in cache with only 100e6 entries, but are too large at 200e6 entries, that would easily explain it.

One would usually do this with a/b tables. The API reads from table_a while table_b is being built and indexed, then you throw a switch so it starts using table_b, while table_a is being rebuilt and reindexed. You could use partitions so that the switch it thrown internally to PostgreSQL, just by detaching one partition and attaching a different one while the user-visible table name stays the same.

This works because you can drop the indexes on the incipient table and recreate them once populated. This is much more efficient than maintaining them during population.

Login or Signup to reply.