i have about 1300 CSV files with almost 40k of rows in each file, i have written a python code to read the file and convert all the 40k entries into single insert statement to insert in Postgres database.
The psudocode is following
for file in tqdm(files, desc="Processing files"):
rows = ReadFile(file) # read all 40k rows from the file
q = GenerateQuery(rows) # convert all rows into single bulk insert statement
InsertData(q) # Execute the generated query to insert data
The code is fine but there are performance issues, when I start the code with empty table it takes around 2 to 3it/s, but after 15 to 20 files it takes 10 to 12it/s, and then the performance drops exponentially with each 10 to 15 files processing, the per iteration time keeps on increasing even it reaches 40 to 50s/it, it’s hilarious, according to my understanding I have developed the following hypothesis
Since in start table is empty its very easy to update table indexes, so it takes no time for 40k bulk insert records to update indexes but with growing records it become harder and even harder to update indexes in the table with 10m+ records.
My Question is can I temporarily disable index updation of the table, so that after complete data dump I will then manually update the indexes by calling some query in postgres which for now I don’t know if it really exists.
2
Answers
No, you cannot disable indexes. If the table is empty or almost empty when you start loading, you can win by dropping the indexes before you start and creating them again afterwards. However,
INSERT
performance should stay constant over time. It is difficult to guess what might be the cause. You can try using auto_explain to captureEXPLAIN (ANALYZE, BUFFERS)
output for a slow insert, that may give you a clue.For bulk loads, you will get the best performance with
COPY
.No, PostgreSQL does not have such a feature (but GIN indexes do have fastupdate, which is kind of similar). But since the table starts out empty, just not having the indexes at first then creating them after the load would give the identical outcome (and this is what pg_dump would arrange to happen).
Maintaining indexes generally involves jumping to arbitrary leaf pages and dirtying those pages. Once the combined size of all the indexes no longer fit in shared_buffers (and certainly once they no longer fit in all of RAM), this is going to involve a lot of random IO, leading to a performance collapse.