skip to Main Content

I am inserting data from the tmp_details table into the details table using an INSERT query in chunks. I am also modifying the data while inserting it. The query is taking a lot of time to execute. My read IOPS is hitting the 2500s, and my write IOPS is near 100. My I/O Utilization is 100%, and my CPU Utilization is less than 10%. My RAM utilization is less than 40%. I am using ctid for chunk inserts of 2,000,000 rows. What can I do to improve the performance of my queries and reduce the I/O utilization? I want to increase CPU utilization to more than 80%.

Server Specifications:

  • PostgreSQL version: 15.6
  • RAM: 32 GB
  • Cores: 16
  • Disk Space: SSD 500 GB
  • OS: Linux Ubuntu 22.04

PostgreSQL Configuration:

max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

Table Details:

Table Name Row Count Size
source_cbsupi_tmp_details 60 Million 30 GB
source_npciupi_tmp_details 60 Million 30 GB
source_cbsupi_tmp_details 60 Million 30 GB

There are indexes on columns uniquekey, key_priority_radcs, key_priority_rtdps, is_processed, and key_priority_ratrs. I had to use the DISTINCT ON clause because I was getting duplicate rows due to the JOIN. I tried inserting in chunks of 1,000,000 rows using ctid, but it is still taking a lot of time to execute, most likely because it has to scan the whole table C and D for each iteration. So, I inserted the data as a whole 60 million rows and then performed a commit at the end. My aim is to run these similar insert queries for table C and D in parallel from the backend app server, but it will be pointless if my I/O Utilization is 100%.

Insert Query:

EXPLAIN
INSERT
    INTO
    cbsupi.source_cbsupi_details (codglacct,
    refusrno,
    key_priority_radcs,
    recon_created_date,
    dattxnposting,
    status,
    uniquekey,
    coddrcr,
    cbsacqiss,
    codacctno,
    amttxnlcy,
    acnotrim,
    priority_no,
    rrn,
    recon_updated_date,
    recon_date_1_to_2,
    recon_date_1_to_3,
    reconciliation_date_time ) (
    SELECT
        DISTINCT ON
        (A.uniquekey) A.codglacct,
        A.refusrno,
        A.key_priority_radcs,
        A.recon_created_date,
        A.dattxnposting,
        A.status,
        A.uniquekey,
        A.coddrcr,
        A.cbsacqiss,
        A.codacctno,
        A.amttxnlcy,
        A.acnotrim,
        A.priority_no,
        A.rrn,
        '2025-01-07 19:50:41' AS recon_updated_date,
        CASE
            WHEN C.key_priority_rtdps IS NOT NULL THEN '2025-01-07 19:50:41'
            ELSE NULL
        END::TIMESTAMP AS recon_date_1_to_2,
        CASE
            WHEN D.key_priority_ratrs IS NOT NULL THEN '2025-01-07 19:50:41'
            ELSE NULL
        END::TIMESTAMP AS recon_date_1_to_3,
        CASE
            WHEN (C.key_priority_rtdps IS NOT NULL
                AND D.key_priority_ratrs IS NOT NULL) THEN '2025-01-07 19:50:41'
            ELSE NULL
        END::TIMESTAMP AS reconciliation_date_time
    FROM
        cbsupi.source_cbsupi_tmp_details A
    LEFT JOIN switchupi.source_switchupi_tmp_details C ON
        (A.key_priority_radcs = C.key_priority_rtdps)
    LEFT JOIN npciupi.source_npciupi_tmp_details D ON
        (A.key_priority_radcs = D.key_priority_ratrs)
    WHERE
        A.is_processed IS NULL ) ON
    CONFLICT (uniquekey) DO
UPDATE
SET
    recon_updated_date = EXCLUDED.recon_updated_date,
    recon_date_1_to_3 = EXCLUDED.recon_date_1_to_3,
    key_priority_radcs = EXCLUDED.key_priority_radcs,
    status = EXCLUDED.status,
    reconciliation_date_time = EXCLUDED.reconciliation_date_time,
    codacctno = EXCLUDED.codacctno,
    amttxnlcy = EXCLUDED.amttxnlcy,
    recon_date_1_to_2 = EXCLUDED.recon_date_1_to_2,
    rrn = EXCLUDED.rrn,
    codglacct = EXCLUDED.codglacct,
    refusrno = EXCLUDED.refusrno,
    dattxnposting = EXCLUDED.dattxnposting,
    coddrcr = EXCLUDED.coddrcr,
    cbsacqiss = EXCLUDED.cbsacqiss,
    acnotrim = EXCLUDED.acnotrim,
    priority_no = EXCLUDED.priority_no;

Explain Results

"QUERY PLAN"
Insert on source_cbsupi_details  (cost=72270111.44..73213761.44 rows=0 width=0)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: source_cbsupi_details_pkey
"  ->  Subquery Scan on ""*SELECT*""  (cost=72270111.44..73213761.44 rows=62910000 width=811)"
        ->  Unique  (cost=72270111.44..72584661.44 rows=62910000 width=823)
              ->  Sort  (cost=72270111.44..72427386.44 rows=62910000 width=823)
                    Sort Key: a.uniquekey
                    ->  Hash Left Join  (cost=10739152.00..50771187.50 rows=62910000 width=823)
                          Hash Cond: (a.key_priority_radcs = d.key_priority_ratrs)
                          ->  Hash Left Join  (cost=5337191.00..25537830.00 rows=62910000 width=800)
                                Hash Cond: (a.key_priority_radcs = c.key_priority_rtdps)
                                ->  Seq Scan on source_cbsupi_tmp_details a  (cost=0.00..2092124.00 rows=62910000 width=767)
                                      Filter: (is_processed IS NULL)
                                ->  Hash  (cost=4118441.00..4118441.00 rows=60000000 width=33)
                                      ->  Seq Scan on source_switchupi_tmp_details c  (cost=0.00..4118441.00 rows=60000000 width=33)
                          ->  Hash  (cost=4124101.00..4124101.00 rows=62910000 width=33)
                                ->  Seq Scan on source_npciupi_tmp_details d  (cost=0.00..4124101.00 rows=62910000 width=33)
JIT:
  Functions: 24
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"

Questions:

  1. How can I improve the performance of the query and reduce the I/O utilization?
  2. Is there a way to run these similar insert queries in parallel from app without hitting the I/O utilization limit?
  3. Will inserting data in chunks benefit me, or is it better to insert the whole data at once? Because, From what I observed, inserting data in chunks is taking more time than inserting the whole data at once.

2

Answers


  1. A couple of things you can try.

    1. Replace on conflict do update with a separate update in a cte then the insert.
    2. IIRC postgres isn’t able to parallelise inserts so if you re-write your update/insert to have consistent sorting on the PK and a way to select non conflicting batches (modulo on the pk) you should be able to kick off inserts in parallel from an application or script.
    Login or Signup to reply.
  2. Some questioning on what you want. Technically you probably are more knowlegable than me on postgresql.

    How active is the table? I once had an organisational complex users table, and the solution was to have two tables, one for fast small queries and a larger table with details. Shoveling data from tmp_details to details will always be the most bulky solution.

    You seem to want a fast update, so a queue really is not what you want. A pity. I used cronjobs and more.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search