Postgresql - How to aggregate and delete duplicate rows in SQL?

FreelanceConsultant
March 13, 2024
102 views
1 vote
2 Answers

I made a mistake when defining a unique key in one of my SQL tables, and this has resulted in duplicate rows of data being entered into the table.

The rows of data look like this:

A, B, C, D, E, created_datetime, last_seen_datetime

The unique key is defined across (A, B, C, D, E). Some of these columns are nullable, and I had forgotten that unique keys do not work when one or more columns can be null.

What I want to do is perform two operations:

Find the minimum created_datetime and maximum last_seen_datetime and set the existing duplicate rows to have these values
Delete existing duplicate rows

Then I want to:

Redefine the unique key so that duplicates cannot be inserted

In order to do this I will:

Delete the existing unique key
Change the columns so that none are nullable
This will require setting any currently null values to something which is non-null but represents a null value
Re-creating the unique key. Now that the columns are not nullable, it will in future enforce uniqueness

I am not sure if this is possible in pure SQL.

I haven’t made much progress towards solving this. I had an idea that perhaps a group_by operation could be used to find duplicate rows.

select
  count(*),
  A,
  B,
  C,
  D,
  E
from
  my_table
group by
  A, B, C, D, E
having
  count(*) > 1;

This does indeed appear to be working but I am now unsure of how to use the found duplicate rows to perform the remaining work.

Tags: postgresql sql

Answers

I think you are on the right track there, perhaps try this: (no promises)
Please do make a copy of your table somewhere before you try this – and don’t commit until you’ve checked it works, for my sanity please!

Step 1: Find the minimum created_datetime and maximum last_seen_datetime for each set of duplicate rows:

SELECT A, B, C, D, E, 
MIN(created_datetime) AS min_created_datetime, 
MAX(last_seen_datetime) AS max_last_seen_datetime
FROM my_table
GROUP BY A, B, C, D, E
HAVING COUNT(*) > 1;

Step 2: Update the existing duplicate rows:

UPDATE my_table AS t

JOIN (
SELECT A, B, C, D, E, 
MIN(created_datetime) AS min_created_datetime, 
MAX(last_seen_datetime) AS max_last_seen_datetime
FROM my_table
GROUP BY A, B, C, D, E
HAVING COUNT(*) > 1
) AS dup ON t.A = dup.A AND t.B = dup.B AND t.C = dup.C AND t.D = dup.D AND t.E = dup.E
SET t.created_datetime = dup.min_created_datetime,
    t.last_seen_datetime = dup.max_last_seen_datetime;

Step 3: get rid of the duplicates:

    DELETE FROM my_table
    WHERE (A, B, C, D, E, created_datetime, last_seen_datetime) NOT IN (
    SELECT A, B, C, D, E, 
    MIN(created_datetime) AS min_created_datetime, 
MAX(last_seen_datetime) AS max_last_seen_datetime
    FROM my_table
    GROUP BY A, B, C, D, E
    HAVING COUNT(*) > 1
    );

Step 4: redefining a unique key for the table:

ALTER TABLE my_table DROP CONSTRAINT your_unique_key_name;
ALTER TABLE my_table
ALTER COLUMN A INT NOT NULL,
ALTER COLUMN B VARCHAR(255) NOT NULL,
ALTER COLUMN C VARCHAR(255) NOT NULL,
ALTER COLUMN D INT NOT NULL,
ALTER COLUMN E VARCHAR(255) NOT NULL;

ALTER TABLE my_table ADD CONSTRAINT your_unique_key_name UNIQUE (A, B, C, D, E);

Obviously I don’t know what your table is called, so don’t forget to change that when you try this.

- AlfinER
- March 13, 2024 at 11:24 am
- 0 votes
0
There are a few options to work around this:

I would create a new table with the correct table definition without deleting the original table; say the origin table is table_a then I would create a new table let’s call it table_a_dummy
```
create table table_a_dummy as (
   ....
)
```
We can coalesce the null values to an "x" if it’s string, or if it’s integer then perhaps we could coalesce them as a "-1". Any values that you would think fits better.

Then I use the row_number function to get the first value of duplicated data. Then, I’d insert that into a table_a_dummy

Assume the data types are string:
```
with cte1 as (
  select
    COALESCE(A,'x') A,
    COALESCE(B,'x') B,
    COALESCE(C,'x') C,
    COALESCE(D,'x') D,
    COALESCE(E,'x') E,
    created_datetime,
    last_seen_datetime
  from
    table_a
),

cte2 as (
  SELECT 
    *,
    row_number() over(partition by A,B,C,D,E order by created_datetime asc, last_seen_datetime desc)
  FROM cte1 
)

INSERT INTO table_a_dummy
(A, B, C, D, E, created_datetime, last_seen_datetime)
SELECT
  A, B, C, D, E, created_datetime, last_seen_datetime
FROM cte2
WHERE rn = 1
```
review table_a_dummy that should have contain cleaner version of table_a without risking to remove the original table.

Once you have reviewed the table_a_dummy and you feel it’s correct — then we can safely remove the table_a
```
drop table `table_a`
```
and rename table_a_dummy to table_a
```
ALTER TABLE `table_a_dummy`
RENAME TO `table_a`;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to aggregate and delete duplicate rows in SQL?

Answers