skip to Main Content

I consider a row duplicate when the combination of hash_key and load_date is repeating. I can find such rows using this CTE:

with w_duplicated_rows as (
    select hash_key, load_date, count(*),
    --row_number() over (partition by hash_key, load_date order by load_date desc) as row_num
    from my_table
    group by hash_key, load_date having count(*) > 1
)

This gives me the combinations which are being repeated in the table.

Now, I want to delete duplicates from the original my_table. I have another column called STATUS. While deleting duplicates, I want to delete the row after filtering on ORDER BY STATUS DESC. Means I want to keep the top row and remove the second one (where status is lower).

How can I achieve this?

2

Answers


  1. I can suggest using ROW_NUMBER() window function to assign a row number to each row within the duplicate groups, ordered by STATUS DESC. Then, you can use a common table expression (CTE) to identify the rows with row numbers greater than 1 and delete them from the original table.

    Here’s an example query:

    WITH ranked_rows AS (
        SELECT
            hash_key,
            load_date,
            STATUS,
            ROW_NUMBER() OVER (PARTITION BY hash_key, load_date ORDER BY STATUS DESC) AS row_num
        FROM my_table
        WHERE (hash_key, load_date) IN (
            SELECT hash_key, load_date
            FROM my_table
            GROUP BY hash_key, load_date
            HAVING COUNT(*) > 1
        )
    )
    DELETE FROM my_table
    WHERE (hash_key, load_date) IN (
        SELECT hash_key, load_date
        FROM ranked_rows
        WHERE row_num > 1
    );
    

    Make sure to test this query on a backup before
    applying it to the entire table.

    Login or Signup to reply.
  2. If you’ve truly got no field to break ties maybe ctid is the best it’ll get. Hard to tell if you are using postgres or redshift. Redshift appears to have a similar concept with row_id. But read the documentation.

    Rationale – Basic strategy to preferentially select the duplicate with the highest status, but add the system unique identifier to break ties.

    Fiddle: https://dbfiddle.uk/MDBsecyC

    create table some_test_data
      (
        hash_key varchar(255),
        load_date date,
        status integer
      );
    insert into some_test_data values
      ('abc', '2024-01-01', 0),
      ('abc', '2024-01-01', 1),
      ('abc', '2024-01-01', 1),
      ('cde', '2024-01-02', 1),
      ('efg', '2024-01-03', 0),
      ('efg', '2024-01-04', 0);
    
    with duplicates_to_delete as 
      (
        select *,
               ctid,
               row_number() over ( partition by
                 hash_key,
                 load_date
                 order by status desc,
                          ctid desc) as rn
          from some_test_data
      )
    delete from some_test_data del_target
      where exists
              (
                select 1
                  from duplicates_to_delete dtd
                 where del_target.hash_key = dtd.hash_key
                   and del_target.load_date = dtd.load_date
                   and del_target.ctid = dtd.ctid
                   and dtd.rn > 1
              );
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search