skip to Main Content

I need to insert about 100 million rows of flat data into a Postgres database on a schedule with a frequency of 24 hours. I.e. every 24 hours, I have an ingestor that inserts data from CSV files in the database.

I don’t know in advance what rows in the data will have changed on every load.
The data in this database gets surfaced/published in an API endpoint which should not have downtime even during reloading the data.

Implementation:
My current approach does the following

I have created 2 indexes; 1 based on an int and a varchar column, and the other based on 2 varchar columns; these indeces are needed to improve the "time to first byte" of the API endpoints.

then

–> Start a transaction

–> Delete all rows from the table

–> Insert all 100 million rows using the Postgres CSV COPY command

–> Commit transaction.

I am doing the operations in a transaction because; the API is expected to be available regardless of data being reloaded.

The issue with this approach is that; the first time it loads, (empty database, empty indexes), is relatively fast (takes about 2 Hours) but subsequent times take an average of 36 hours.

Can anyone please suggest what the bottleneck might be and maybe another approach to solve this problem?

I suspect the indexes are getting in the way of subsequent inserts.

2

Answers


  1. You should definitely TRUNCATE, not DELETE, if you want to empty the table. DELETE just marks the rows as invalid.

    For further massive speed gains, you’d have to set wal_level = minimal. Then such a transaction can avoid writing WAL. This may not be feasible, however: it requires a restart and will break replication and archive recovery-

    Login or Signup to reply.
  2. I suspect the indexes are getting in the way of subsequent inserts.

    Sure, that is plausible. There is no way we can tell from the info you have given us. Once the indexes are too large to stay in cache, maintaining them will suck. If the indexes started out fitting in cache with only 100e6 entries, but are too large at 200e6 entries, that would easily explain it.

    One would usually do this with a/b tables. The API reads from table_a while table_b is being built and indexed, then you throw a switch so it starts using table_b, while table_a is being rebuilt and reindexed. You could use partitions so that the switch it thrown internally to PostgreSQL, just by detaching one partition and attaching a different one while the user-visible table name stays the same.

    This works because you can drop the indexes on the incipient table and recreate them once populated. This is much more efficient than maintaining them during population.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search