I am inserting data from the tmp_details
table into the details
table using an INSERT query in chunks. I am also modifying the data while inserting it. The query is taking a lot of time to execute. My read IOPS is hitting the 2500s, and my write IOPS is near 100. My I/O Utilization is 100%, and my CPU Utilization is less than 10%. My RAM utilization is less than 40%. I am using ctid
for chunk inserts of 2,000,000 rows. What can I do to improve the performance of my queries and reduce the I/O utilization? I want to increase CPU utilization to more than 80%.
Server Specifications:
- PostgreSQL version: 15.6
- RAM: 32 GB
- Cores: 16
- Disk Space: SSD 500 GB
- OS: Linux Ubuntu 22.04
PostgreSQL Configuration:
max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
Table Details:
Table Name | Row Count | Size |
---|---|---|
source_cbsupi_tmp_details | 60 Million | 30 GB |
source_npciupi_tmp_details | 60 Million | 30 GB |
source_cbsupi_tmp_details | 60 Million | 30 GB |
There are indexes on columns uniquekey
, key_priority_radcs
, key_priority_rtdps
, is_processed
, and key_priority_ratrs
. I had to use the DISTINCT ON clause because I was getting duplicate rows due to the JOIN. I tried inserting in chunks of 1,000,000 rows using ctid
, but it is still taking a lot of time to execute, most likely because it has to scan the whole table C and D for each iteration. So, I inserted the data as a whole 60 million rows and then performed a commit at the end. My aim is to run these similar insert queries for table C and D in parallel from the backend app server, but it will be pointless if my I/O Utilization is 100%.
Insert Query:
EXPLAIN
INSERT
INTO
cbsupi.source_cbsupi_details (codglacct,
refusrno,
key_priority_radcs,
recon_created_date,
dattxnposting,
status,
uniquekey,
coddrcr,
cbsacqiss,
codacctno,
amttxnlcy,
acnotrim,
priority_no,
rrn,
recon_updated_date,
recon_date_1_to_2,
recon_date_1_to_3,
reconciliation_date_time ) (
SELECT
DISTINCT ON
(A.uniquekey) A.codglacct,
A.refusrno,
A.key_priority_radcs,
A.recon_created_date,
A.dattxnposting,
A.status,
A.uniquekey,
A.coddrcr,
A.cbsacqiss,
A.codacctno,
A.amttxnlcy,
A.acnotrim,
A.priority_no,
A.rrn,
'2025-01-07 19:50:41' AS recon_updated_date,
CASE
WHEN C.key_priority_rtdps IS NOT NULL THEN '2025-01-07 19:50:41'
ELSE NULL
END::TIMESTAMP AS recon_date_1_to_2,
CASE
WHEN D.key_priority_ratrs IS NOT NULL THEN '2025-01-07 19:50:41'
ELSE NULL
END::TIMESTAMP AS recon_date_1_to_3,
CASE
WHEN (C.key_priority_rtdps IS NOT NULL
AND D.key_priority_ratrs IS NOT NULL) THEN '2025-01-07 19:50:41'
ELSE NULL
END::TIMESTAMP AS reconciliation_date_time
FROM
cbsupi.source_cbsupi_tmp_details A
LEFT JOIN switchupi.source_switchupi_tmp_details C ON
(A.key_priority_radcs = C.key_priority_rtdps)
LEFT JOIN npciupi.source_npciupi_tmp_details D ON
(A.key_priority_radcs = D.key_priority_ratrs)
WHERE
A.is_processed IS NULL ) ON
CONFLICT (uniquekey) DO
UPDATE
SET
recon_updated_date = EXCLUDED.recon_updated_date,
recon_date_1_to_3 = EXCLUDED.recon_date_1_to_3,
key_priority_radcs = EXCLUDED.key_priority_radcs,
status = EXCLUDED.status,
reconciliation_date_time = EXCLUDED.reconciliation_date_time,
codacctno = EXCLUDED.codacctno,
amttxnlcy = EXCLUDED.amttxnlcy,
recon_date_1_to_2 = EXCLUDED.recon_date_1_to_2,
rrn = EXCLUDED.rrn,
codglacct = EXCLUDED.codglacct,
refusrno = EXCLUDED.refusrno,
dattxnposting = EXCLUDED.dattxnposting,
coddrcr = EXCLUDED.coddrcr,
cbsacqiss = EXCLUDED.cbsacqiss,
acnotrim = EXCLUDED.acnotrim,
priority_no = EXCLUDED.priority_no;
Explain Results
"QUERY PLAN"
Insert on source_cbsupi_details (cost=72270111.44..73213761.44 rows=0 width=0)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: source_cbsupi_details_pkey
" -> Subquery Scan on ""*SELECT*"" (cost=72270111.44..73213761.44 rows=62910000 width=811)"
-> Unique (cost=72270111.44..72584661.44 rows=62910000 width=823)
-> Sort (cost=72270111.44..72427386.44 rows=62910000 width=823)
Sort Key: a.uniquekey
-> Hash Left Join (cost=10739152.00..50771187.50 rows=62910000 width=823)
Hash Cond: (a.key_priority_radcs = d.key_priority_ratrs)
-> Hash Left Join (cost=5337191.00..25537830.00 rows=62910000 width=800)
Hash Cond: (a.key_priority_radcs = c.key_priority_rtdps)
-> Seq Scan on source_cbsupi_tmp_details a (cost=0.00..2092124.00 rows=62910000 width=767)
Filter: (is_processed IS NULL)
-> Hash (cost=4118441.00..4118441.00 rows=60000000 width=33)
-> Seq Scan on source_switchupi_tmp_details c (cost=0.00..4118441.00 rows=60000000 width=33)
-> Hash (cost=4124101.00..4124101.00 rows=62910000 width=33)
-> Seq Scan on source_npciupi_tmp_details d (cost=0.00..4124101.00 rows=62910000 width=33)
JIT:
Functions: 24
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
Questions:
- How can I improve the performance of the query and reduce the I/O utilization?
- Is there a way to run these similar insert queries in parallel from app without hitting the I/O utilization limit?
- Will inserting data in chunks benefit me, or is it better to insert the whole data at once? Because, From what I observed, inserting data in chunks is taking more time than inserting the whole data at once.
2
Answers
A couple of things you can try.
Some questioning on what you want. Technically you probably are more knowlegable than me on postgresql.
How active is the table? I once had an organisational complex users table, and the solution was to have two tables, one for fast small queries and a larger table with details. Shoveling data from
tmp_details
todetails
will always be the most bulky solution.You seem to want a fast update, so a queue really is not what you want. A pity. I used cronjobs and more.