Postgresql - Assigning all the dependencies from a duplicate row to the original

MiguelRobledo
August 30, 2024
117 views
1 vote
2 Answers

I have a database with the following structure

table_a
| id       | username |
| -------- | -------- |
| 1        | 'a.a'    |
| 2        | 'a.a'    |

table_b

| id       | table_a_id |
| -------- | ---------- |
| 1        | 1          |
| 2        | 3          |

table_c

| id       | table_a_id |
| -------- | ---------- |
| 1        | 2          |
| 2        | 3          |

I want to automattically find the dependences of table_a in the database and change the ids of the second user to the one of the first user so that the end result would look like this:

table_a
| id       | username |
| -------- | -------- |
| 1        | 'a.a'    |
| 2        | 'a.a'    |

table_b

| id       | table_a_id |
| -------- | ---------- |
| 1        | 1          |
| 2        | 3          |

table_c

| id       | table_a_id |
| -------- | ---------- |
| 1        | 1          |
| 2        | 3          |

I can do it by listing the known tables, but I haven’t been able to automatically find all the dependencies in the database.

Tags: postgresql sql

Answers

- LajosArpad
- August 29, 2024 at 3:50 pm
- 0 votes
0
Something like this:
```
update table_b
set table_b.table_a_id = min_a.id
from table_a current
join table_a min_a
on current.username = min_a.username
left join table_a earlier
on min_a.username = earlier.username and earlier.id < min_a.id
where earlier.id is null and table_b.table_a_id = current.id and current.id > min_a.id
```
Explanation:
- we use an update-join
- we update table_b
- from table_a current that is the current match of table_b in table_a
- join with table_a min_a which represents the record in table_a having the minimum id with the same username
- left joined with table_a earlier to guarantee that min_a is indeed the minimum match
- so earlier.id is to be null
- and we avoid doing an update if the current match is already the minimum
Login or Signup to reply.

The PostgreSQL system catalogs can be used to identify referential dependencies and to generate code to delete redundant values from the parent table after updating the references in the child tables.

The following script creates and populates tables to demonstrate addressing redundant references:

CREATE TABLE table_a (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  username TEXT
);

CREATE TABLE table_b (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  table_a_id INTEGER REFERENCES table_a(id)
);

CREATE TABLE table_c (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  table_a_id INTEGER REFERENCES table_a(id)
);

INSERT INTO table_a(username)
VALUES ('a.a'), ('a.a'), ('a.b');

INSERT INTO table_b(table_a_id)
VALUES (1), (3);

INSERT INTO table_c(table_a_id)
VALUES (2), (3);

The resulting table contents are:

Table A:

id	username
1	a.a
2	a.a
3	a.b

Table B:

id	table_a_id
1	1
2	3

Table C:

id	table_a_id
1	2
2	3

The following DO block uses the system catalogs to generate a statement for each parent table. The generated statement employs chained CTEs to identify redundant parent rows, update the child table references to the redundant rows, and then delete the redundant rows. (This code addresses cases where each foreign key references a single column and none of the child tables has a unique index that will be violated when references are updated.)

DO LANGUAGE plpgsql
$BLOCK$
DECLARE
  rec record;
BEGIN
  FOR rec IN
    WITH
      cte AS (
        SELECT
          c.confrelid AS parent_table_oid,
          c.confkey[1] AS parent_column_num,
          -- generate table expressions to update the dependent child tables
          FORMAT(
            'u%1$s AS (UPDATE %2$I.%3$I n SET %4$I = r.kept_id FROM r WHERE n.%4$I = ANY(r.redundant_ids))',
            ROW_NUMBER() OVER (PARTITION BY c.confrelid ORDER BY child_tables.oid),
            child_ns.nspname,
            child_tables.relname,
            child_columns.attname
          ) AS update_cte
        FROM
          pg_constraint c
          JOIN pg_class child_tables ON child_tables.oid = c.conrelid
          JOIN pg_namespace child_ns ON child_ns.oid = child_tables.relnamespace
          JOIN pg_attribute child_columns ON child_columns.attrelid = c.conrelid
          AND child_columns.attnum = c.conkey[1]
        WHERE
          c.confrelid = 'table_a'::regclass
      )
    SELECT
      -- generate a table expression to identify redundant rows in the parent table
      FORMAT(
        'WITH r AS (SELECT MIN(%1$I) AS kept_id, (ARRAY_AGG(%1$I ORDER BY %1$I))[2:] AS redundant_ids FROM %2$I.%3$I GROUP BY username HAVING COUNT(*) > 1),',
        parent_columns.attname,
        parent_ns.nspname,
        parent_tables.relname
      ) ||
      -- concatenate associated child update table expressions
      STRING_AGG(cte.update_cte, ',') ||
      -- append the delete statement to remove redundant rows from the parent table
      FORMAT(
        'DELETE FROM %1$I.%2$I USING r WHERE %2$I.%3$I = ANY(r.redundant_ids)',
        parent_ns.nspname,
        parent_tables.relname,
        parent_columns.attname
      ) AS statement
    FROM
      cte
      JOIN pg_class parent_tables ON parent_tables.oid = cte.parent_table_oid
      JOIN pg_attribute parent_columns ON parent_columns.attrelid = cte.parent_table_oid
      AND parent_columns.attnum = cte.parent_column_num
      JOIN pg_namespace parent_ns ON parent_ns.oid = parent_tables.relnamespace
    GROUP BY
      parent_ns.nspname,
      parent_tables.relname,
      parent_columns.attname LOOP
  EXECUTE rec.statement;
  END loop;
END;
$BLOCK$;

Below is a formatted example of the generated SQL:

WITH
  r AS (
    SELECT MIN(id) AS kept_id, (ARRAY_AGG(id ORDER BY id))[2:] AS redundant_ids
      FROM public.table_a
      GROUP BY username
      HAVING COUNT(*) > 1
  ),
  u1 AS (
    UPDATE public.table_b n
      SET table_a_id = r.kept_id
      FROM r
      WHERE n.table_a_id = ANY (r.redundant_ids)
  ),
  u2 AS (
    UPDATE public.table_c n
      SET table_a_id = r.kept_id
      FROM r
      WHERE n.table_a_id = ANY (r.redundant_ids)
  )
DELETE FROM public.table_a
  USING r
  WHERE table_a.id = ANY (r.redundant_ids);

After running the script, the demonstration tables have the following content:

Table A:

id	username
1	a.a
3	a.b

Table B:

id	table_a_id
1	1
2	3

Table C:

id	table_a_id
1	1
2	3

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Assigning all the dependencies from a duplicate row to the original

Answers