Postgres find duplicates using multiple columns and delete one row based on column value - Postgresql

RKIDEV
December 30, 2022
236 views
3 votes
3 Answers

I have a table below with sample data.

|  id  | type| cycle|    datetimestamp    |  status1  |  status2  |
|:----:|:---:|:----:|:-------------------:|:---------:|:---------:|
| 27   | AN  |  123 | 2022-12-28 04:12:31 | Normal A  | Normal A  |
| 27   | AR  |  124 | 2022-12-28 04:12:31 | Normal A  | Normal A  |<--Delete
| 19   | AN  |  125 | 2022-12-28 05:24:30 | Normal A  | Normal A  |
| 19   | AR  |  126 | 2022-12-28 06:18:20 | Normal A  | Normal A  |
| 19   | AR  |  234 | 2022-12-28 07:22:20 | Normal A  | Normal A  |<--Delete
| 19   | AN  |  235 | 2022-12-28 07:22:20 | Normal A  | Normal A  |
| 20   | AR  |  236 | 2022-12-28 08:25:49 | Normal A  | Normal A  |<--Delete
| 20   | AN  |  237 | 2022-12-28 08:25:49 | Normal A  | Normal A  |
| 19   | AR  |  129 | 2022-12-28 09:08:19 | Normal A  | Normal A  |<--Delete
| 19   | AN  |  127 | 2022-12-28 09:08:19 | Normal A  | Normal A  |
| 19   | AR  |  238 | 2022-12-28 10:04:31 | Normal A  | Normal A  |<--Delete
| 19   | AN  |  230 | 2022-12-28 10:04:31 | Normal A  | Normal A  |
| 22   | AN  |  239 | 2022-12-28 11:04:58 | Normal A  | Normal A  |
| 22   | AR  |  256 | 2022-12-28 11:04:58 | Normal A  | Normal A  |<--Delete

I want to find out the duplicates using columns(id, type, datetimestamp, status1, status2)
and delete the duplicate row having column type=’AR'(keep the type=’AN’). Expecting below output –

|  id  | type| cycle|    datetimestamp    |  status1  |  status2  |
|:----:|:---:|:----:|:-------------------:|:---------:|:---------:|
| 27   | AN  |  123 | 2022-12-28 04:12:31 | Normal A  | Normal A  |
| 19   | AN  |  125 | 2022-12-28 05:24:30 | Normal A  | Normal A  |
| 19   | AR  |  126 | 2022-12-28 06:18:20 | Normal A  | Normal A  |
| 19   | AN  |  235 | 2022-12-28 07:22:20 | Normal A  | Normal A  |
| 20   | AN  |  237 | 2022-12-28 08:25:49 | Normal A  | Normal A  |
| 19   | AN  |  127 | 2022-12-28 09:08:19 | Normal A  | Normal A  |
| 19   | AN  |  230 | 2022-12-28 10:04:31 | Normal A  | Normal A  |
| 22   | AN  |  239 | 2022-12-28 11:04:58 | Normal A  | Normal A  |

But my query is returning having type=’AN’ not ‘AR’.

select * from test_data e
where exists
 ( select * from test_data e2 
   where e.datetimestamp=e2.datetimestamp and e.id=e2.id 
     and e.status1=e2.status1 
     and e.status2=e2.status2 
     and e.type='AN' and e2.type='AR') order by e.datetimestamp asc;

Setup table queries:

~~~pgsql
CREATE TABLE test_data (
 id character varying(2) NOT NULL,
 type character varying(2),
 cycle integer,
 datetimestamp timestamp without time zone NOT NULL,
 status1 character varying(10),
 status2 character varying(10),
 PRIMARY KEY(id, cycle, datetimestamp)
);

INSERT INTO test_data VALUES
  (27, 'AN', 123, '2022-12-28 04:12:31', 'Normal A', 'Normal A')
, (27, 'AR', 124, '2022-12-28 04:12:31', 'Normal A', 'Normal A')
, (19, 'AN', 125, '2022-12-28 05:24:30', 'Normal A', 'Normal A')
, (19, 'AR', 126, '2022-12-28 06:18:20', 'Normal A', 'Normal A')
, (19, 'AR', 234, '2022-12-28 07:22:20', 'Normal A', 'Normal A')
, (19, 'AN', 235, '2022-12-28 07:22:20', 'Normal A', 'Normal A')
, (20, 'AR', 236, '2022-12-28 08:25:49', 'Normal A', 'Normal A')
, (20, 'AN', 237, '2022-12-28 08:25:49', 'Normal A', 'Normal A')
, (19, 'AR', 129, '2022-12-28 09:08:19', 'Normal A', 'Normal A')
, (19, 'AN', 127, '2022-12-28 09:08:19', 'Normal A', 'Normal A')
, (19, 'AR', 238, '2022-12-28 10:04:31', 'Normal A', 'Normal A')
, (19, 'AN', 230, '2022-12-28 10:04:31', 'Normal A', 'Normal A')
, (22, 'AN', 239, '2022-12-28 11:04:58', 'Normal A', 'Normal A')
, (22, 'AR', 256, '2022-12-28 11:04:58', 'Normal A', 'Normal A')
;
~~~

Answers

- MabuKloesen
- December 30, 2022 at 9:04 am
- 0 votes
0
You can list the deleted row by this query:
```
ALTER TABLE test_data ADD COLUMN test_data_id serial;

SELECT t.test_data_id 
FROM test_data t
JOIN (
        SELECT array_agg(test_data_id) as arr from test_data GROUP BY id, datetimestamp, status1, status2 HAVING COUNT(*) > 1) as dup
    ON t.test_data_id = ANY(dup.arr)
WHERE t.type = 'AR'
```
So the delete query will be:
```
DELETE FROM test_data 
WHERE test_data_id IN (
        select t.test_data_id 
        from test_data t
        JOIN (
                select array_agg(test_data_id) as arr from test_data GROUP BY id, datetimestamp, status1, status2 HAVING COUNT(*) > 1
            ) as dup ON t.test_data_id = ANY(dup.arr)
        WHERE t.type = 'AR'
)
```
A little explain: My solution is you create a new unique column and take it as a primary key. Then you can filter the duplicate row and remove it with the type ‘AR’.

p/s: if you don’t allow to add 1 more column you can use ctid column instead but I don’t refer this solution.

Hopefully, it will help you.
Login or Signup to reply.

- Edouard
- December 30, 2022 at 12:28 pm
- 0 votes
0
Try this :
```
DELETE FROM test_data a
USING test_data b
WHERE a.id = b.id
  AND a.type = 'AN'
  AND b.type = 'AR'
  AND a.datetimestamp = b.datetimestamp
  AND a.status1 = b.status2
  AND a.status2 = b.status2
```
test result in dbfiddle
Login or Signup to reply.

- GMB
- December 31, 2022 at 1:36 am
- 0 votes
0
exists looks like a good approach. Starting from your query, I just fixed the conditions on type (we want to delete "AR" record based on the presence of "AN") ; I also rewrote the multiple equalities to tuple equality – but that’s just syntactic sugar:
```
delete 
from test_data e1
where type = 'AR' and exists (
    select 1 
    from test_data e2
    where (e2.datetimestamp, e2.id, e2.status1, e2.status2, e2.type) 
        = (e1.datetimestamp, e1.id, e1.status1, e1.status2, 'AN')
)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgres find duplicates using multiple columns and delete one row based on column value – Postgresql

Answers