skip to Main Content

I am reading through a large list of data and adding it to a PostgreSQL database. The problem is, sometimes there are duplicate values in the data I’m reading in, but sometimes they fill in data that was missing previously. To remedy this, I have added the following to my script, but it is very ugly:

INSERT INTO tab(id,col1,col2,col3,...) VALUES (i,v1,v2,v3,...)
ON CONFLICT (id)
    DO UPDATE 
        SET 
            (col1,col2,col3, ...)=(
                COALESCE(tab.col1, EXCLUDED.col1),
                COALESCE(tab.col2, EXCLUDED.col2),
                COALESCE(tab.col3, EXCLUDED.col3),
                ...
             );

I’m hoping there’s a more elegant solution than manually writing out every column in the table. I have a few more tables I need to write these for so I’d prefer if there was a more generic way to do this.

EDIT: I’m pretty new to this, and reading the docs, this may be a dumb way to do this in the first place. Please let me know if I should even be using the INSERT command for this, it looks like maybe just UPDATE or some form of JOIN could accomplish the same thing?

Postgres version:
psql (PostgreSQL) 12.14 (Ubuntu 12.14-0ubuntu0.20.04.1)

2

Answers


  1. It’s necessary to list every column, but it’s not necessary to enter the list manually. The following query is an example of using column information found in information_schema to generate the SET clause:

    WITH query_fragments AS (
        SELECT
            string_agg(quote_ident(c.column_name), ', ' ORDER BY c.ordinal_position) AS column_list,
            string_agg(format('COALESCE(tab.%I, excluded.%I)', c.column_name, c.column_name), ', ' ORDER BY c.ordinal_position) AS column_values
        FROM
            information_schema.columns c
        WHERE
            c.table_schema = 'public'
            AND c.table_name = 'tab'
            AND c.column_name <> 'id'
    )
    SELECT
        format('SET (%s) = (%s)', column_list, column_values) AS set_clause
    FROM
        query_fragments;
    

    This could easily be incorporated into a function and expanded to generate the entire insert query for each table.

    Login or Signup to reply.
  2. Your query looks mostly good. As does John‘s meta-query to build it.

    One major issue remains: don’t update rows that don’t actually change. That accrues the full update cost for no gain.

    INSERT INTO tab AS t
           (id, col1, col2, col3)
    VALUES (i , v1  , v2  , v3  )
    ON     CONFLICT (id) DO UPDATE 
    SET   (col1, col2, col3) =
          (COALESCE(t.col1, EXCLUDED.col1),
           COALESCE(t.col2, EXCLUDED.col2),
           COALESCE(t.col3, EXCLUDED.col3))
    WHERE  EXCLUDED IS DISTINCT FROM t;

    Better (but more verbose):

    ...
    WHERE (col1, col2, col3) IS DISTINCT FROM 
          (COALESCE(t.col1, EXCLUDED.col1),
           COALESCE(t.col2, EXCLUDED.col2),
           COALESCE(t.col3, EXCLUDED.col3));     -- !!!
    

    The first suggestion only suppresses updates where the input row is exactly the same as the existing row.
    The second (better) suggestion suppresses all empty updates.

    Related:

    Superior approach

    My above suggestions help to minimize the number of expensive updates.
    If you can manage without adding too much overhead, don’t update the same row repeatedly at all. Consolidate multiple input rows into a single row before applying the UPDATE.

    Like:

    INSERT INTO tab AS t
          (id, col1   , col2   , col3)
    SELECT i , min(v1), min(v2), min(v3)
    FROM   my_input_rows i
    GROUP  BY 1
    ON     CONFLICT (id) DO UPDATE 
    SET   (col1, col2, col3) =
          (COALESCE(t.col1, EXCLUDED.col1),
           COALESCE(t.col2, EXCLUDED.col2),
           COALESCE(t.col3, EXCLUDED.col3))
    WHERE (col1, col2, col3) IS DISTINCT FROM 
          (COALESCE(t.col1, EXCLUDED.col1),
           COALESCE(t.col2, EXCLUDED.col2),
           COALESCE(t.col3, EXCLUDED.col3));
    

    You might first write to a TEMPORARY staging table my_input_rows and take it from there. Or use VALUES expression like in your initial code, just not directly attached to the INSERT, so you may need explicit type casts. See:

    I chose min() as it ignores null input. If you either have null or one distinct value per set, that does it.

    This can even be done for multiple target tables in a single SQL statement, employing multiple data-modifying CTEs.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search