I consider a row duplicate when the combination of hash_key and load_date is repeating. I can find such rows using this CTE:
with w_duplicated_rows as (
select hash_key, load_date, count(*),
--row_number() over (partition by hash_key, load_date order by load_date desc) as row_num
from my_table
group by hash_key, load_date having count(*) > 1
)
This gives me the combinations which are being repeated in the table.
Now, I want to delete duplicates from the original my_table
. I have another column called STATUS. While deleting duplicates, I want to delete the row after filtering on ORDER BY STATUS DESC
. Means I want to keep the top row and remove the second one (where status is lower).
How can I achieve this?
2
Answers
I can suggest using
ROW_NUMBER()
window function to assign a row number to each row within the duplicate groups, ordered bySTATUS DESC
. Then, you can use a common table expression(CTE)
to identify the rows with row numbers greater than 1 and delete them from the original table.Here’s an example query:
If you’ve truly got no field to break ties maybe ctid is the best it’ll get. Hard to tell if you are using postgres or redshift. Redshift appears to have a similar concept with row_id. But read the documentation.
Rationale – Basic strategy to preferentially select the duplicate with the highest status, but add the system unique identifier to break ties.
Fiddle: https://dbfiddle.uk/MDBsecyC