I am reading through a large list of data and adding it to a PostgreSQL database. The problem is, sometimes there are duplicate values in the data I’m reading in, but sometimes they fill in data that was missing previously. To remedy this, I have added the following to my script, but it is very ugly:
INSERT INTO tab(id,col1,col2,col3,...) VALUES (i,v1,v2,v3,...)
ON CONFLICT (id)
DO UPDATE
SET
(col1,col2,col3, ...)=(
COALESCE(tab.col1, EXCLUDED.col1),
COALESCE(tab.col2, EXCLUDED.col2),
COALESCE(tab.col3, EXCLUDED.col3),
...
);
I’m hoping there’s a more elegant solution than manually writing out every column in the table. I have a few more tables I need to write these for so I’d prefer if there was a more generic way to do this.
EDIT: I’m pretty new to this, and reading the docs, this may be a dumb way to do this in the first place. Please let me know if I should even be using the INSERT
command for this, it looks like maybe just UPDATE
or some form of JOIN
could accomplish the same thing?
Postgres version:
psql (PostgreSQL) 12.14 (Ubuntu 12.14-0ubuntu0.20.04.1)
2
Answers
It’s necessary to list every column, but it’s not necessary to enter the list manually. The following query is an example of using column information found in information_schema to generate the SET clause:
This could easily be incorporated into a function and expanded to generate the entire insert query for each table.
Your query looks mostly good. As does John‘s meta-query to build it.
One major issue remains: don’t update rows that don’t actually change. That accrues the full update cost for no gain.
Better (but more verbose):
The first suggestion only suppresses updates where the input row is exactly the same as the existing row.
The second (better) suggestion suppresses all empty updates.
Related:
Superior approach
My above suggestions help to minimize the number of expensive updates.
If you can manage without adding too much overhead, don’t update the same row repeatedly at all. Consolidate multiple input rows into a single row before applying the
UPDATE
.Like:
You might first write to a
TEMPORARY
staging tablemy_input_rows
and take it from there. Or useVALUES
expression like in your initial code, just not directly attached to theINSERT
, so you may need explicit type casts. See:I chose
min()
as it ignoresnull
input. If you either havenull
or one distinct value per set, that does it.This can even be done for multiple target tables in a single SQL statement, employing multiple data-modifying CTEs.