skip to Main Content

I made a mistake when defining a unique key in one of my SQL tables, and this has resulted in duplicate rows of data being entered into the table.

The rows of data look like this:

A, B, C, D, E, created_datetime, last_seen_datetime

The unique key is defined across (A, B, C, D, E). Some of these columns are nullable, and I had forgotten that unique keys do not work when one or more columns can be null.

What I want to do is perform two operations:

  • Find the minimum created_datetime and maximum last_seen_datetime and set the existing duplicate rows to have these values
  • Delete existing duplicate rows

Then I want to:

  • Redefine the unique key so that duplicates cannot be inserted

In order to do this I will:

  • Delete the existing unique key
  • Change the columns so that none are nullable
  • This will require setting any currently null values to something which is non-null but represents a null value
  • Re-creating the unique key. Now that the columns are not nullable, it will in future enforce uniqueness

I am not sure if this is possible in pure SQL.

I haven’t made much progress towards solving this. I had an idea that perhaps a group_by operation could be used to find duplicate rows.

select
  count(*),
  A,
  B,
  C,
  D,
  E
from
  my_table
group by
  A, B, C, D, E
having
  count(*) > 1;

This does indeed appear to be working but I am now unsure of how to use the found duplicate rows to perform the remaining work.

2

Answers


  1. I think you are on the right track there, perhaps try this: (no promises)
    Please do make a copy of your table somewhere before you try this – and don’t commit until you’ve checked it works, for my sanity please!

    Step 1: Find the minimum created_datetime and maximum last_seen_datetime for each set of duplicate rows:

    SELECT A, B, C, D, E, 
    MIN(created_datetime) AS min_created_datetime, 
    MAX(last_seen_datetime) AS max_last_seen_datetime
    FROM my_table
    GROUP BY A, B, C, D, E
    HAVING COUNT(*) > 1;
    

    Step 2: Update the existing duplicate rows:

    UPDATE my_table AS t
    
    JOIN (
    SELECT A, B, C, D, E, 
    MIN(created_datetime) AS min_created_datetime, 
    MAX(last_seen_datetime) AS max_last_seen_datetime
    FROM my_table
    GROUP BY A, B, C, D, E
    HAVING COUNT(*) > 1
    ) AS dup ON t.A = dup.A AND t.B = dup.B AND t.C = dup.C AND t.D = dup.D AND t.E = dup.E
    SET t.created_datetime = dup.min_created_datetime,
        t.last_seen_datetime = dup.max_last_seen_datetime;
    

    Step 3: get rid of the duplicates:

        DELETE FROM my_table
        WHERE (A, B, C, D, E, created_datetime, last_seen_datetime) NOT IN (
        SELECT A, B, C, D, E, 
        MIN(created_datetime) AS min_created_datetime, 
    MAX(last_seen_datetime) AS max_last_seen_datetime
        FROM my_table
        GROUP BY A, B, C, D, E
        HAVING COUNT(*) > 1
        );
    

    Step 4: redefining a unique key for the table:

    ALTER TABLE my_table DROP CONSTRAINT your_unique_key_name;
    ALTER TABLE my_table
    ALTER COLUMN A INT NOT NULL,
    ALTER COLUMN B VARCHAR(255) NOT NULL,
    ALTER COLUMN C VARCHAR(255) NOT NULL,
    ALTER COLUMN D INT NOT NULL,
    ALTER COLUMN E VARCHAR(255) NOT NULL;
    
    ALTER TABLE my_table ADD CONSTRAINT your_unique_key_name UNIQUE (A, B, C, D, E);
    

    Obviously I don’t know what your table is called, so don’t forget to change that when you try this.

    Login or Signup to reply.
  2. There are a few options to work around this:

    I would create a new table with the correct table definition without deleting the original table; say the origin table is table_a then I would create a new table let’s call it table_a_dummy

    create table table_a_dummy as (
       ....
    )
    

    We can coalesce the null values to an "x" if it’s string, or if it’s integer then perhaps we could coalesce them as a "-1". Any values that you would think fits better.

    Then I use the row_number function to get the first value of duplicated data. Then, I’d insert that into a table_a_dummy

    Assume the data types are string:

    with cte1 as (
      select
        COALESCE(A,'x') A,
        COALESCE(B,'x') B,
        COALESCE(C,'x') C,
        COALESCE(D,'x') D,
        COALESCE(E,'x') E,
        created_datetime,
        last_seen_datetime
      from
        table_a
    ),
    
    cte2 as (
      SELECT 
        *,
        row_number() over(partition by A,B,C,D,E order by created_datetime asc, last_seen_datetime desc)
      FROM cte1 
    )
    
    INSERT INTO table_a_dummy
    (A, B, C, D, E, created_datetime, last_seen_datetime)
    SELECT
      A, B, C, D, E, created_datetime, last_seen_datetime
    FROM cte2
    WHERE rn = 1
    

    review table_a_dummy that should have contain cleaner version of table_a without risking to remove the original table.

    Once you have reviewed the table_a_dummy and you feel it’s correct — then we can safely remove the table_a

    drop table `table_a`
    

    and rename table_a_dummy to table_a

    ALTER TABLE `table_a_dummy`
    RENAME TO `table_a`;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search