In the following data:
id_1 | id_2 | colA | colB | colC |
---|---|---|---|---|
1 | 2 | 2022-01-02 | scroll fast | 12 |
1 | 2 | 2022-01-02 | scroll fast | 12 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
I would like to remove duplicates only when they occur in pair. The output table should have one of the rows from the first two rows and remaining rows should be left as is.
The output table would be
id_1 | id_2 | colA | colB | colC |
---|---|---|---|---|
1 | 2 | 2022-01-02 | scroll fast | 12 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
1 | 3 | 2022-01-03 | scroll fast fast | 11 |
I have large dataset, so I cannot filter based on Id_1 and id_2. I am looking for a generic solution.
Test data is here at Sqlfiddle.
2
Answers
We can use a combination of
COUNT()
andROW_NUMBER()
, both as analytic functions:The above logic returns rows appearing only once or appearing 3 times or more. In the case of pair duplicates, it arbitrarily returns one of them.
You can use a
MERGE
statement and count the size of each group using analytic functions and then delete the second row of a group when there are only 2 rows per group:Which, for the sample data:
Then after the
MERGE
the table contains:fiddle