I made a mistake when defining a unique key in one of my SQL tables, and this has resulted in duplicate rows of data being entered into the table.
The rows of data look like this:
A, B, C, D, E, created_datetime, last_seen_datetime
The unique key is defined across (A, B, C, D, E)
. Some of these columns are nullable, and I had forgotten that unique keys do not work when one or more columns can be null
.
What I want to do is perform two operations:
- Find the minimum
created_datetime
and maximumlast_seen_datetime
and set the existing duplicate rows to have these values - Delete existing duplicate rows
Then I want to:
- Redefine the unique key so that duplicates cannot be inserted
In order to do this I will:
- Delete the existing unique key
- Change the columns so that none are nullable
- This will require setting any currently null values to something which is non-null but represents a null value
- Re-creating the unique key. Now that the columns are not nullable, it will in future enforce uniqueness
I am not sure if this is possible in pure SQL.
I haven’t made much progress towards solving this. I had an idea that perhaps a group_by
operation could be used to find duplicate rows.
select
count(*),
A,
B,
C,
D,
E
from
my_table
group by
A, B, C, D, E
having
count(*) > 1;
This does indeed appear to be working but I am now unsure of how to use the found duplicate rows to perform the remaining work.
2
Answers
I think you are on the right track there, perhaps try this: (no promises)
Please do make a copy of your table somewhere before you try this – and don’t commit until you’ve checked it works, for my sanity please!
Step 1: Find the minimum created_datetime and maximum last_seen_datetime for each set of duplicate rows:
Step 2: Update the existing duplicate rows:
Step 3: get rid of the duplicates:
Step 4: redefining a unique key for the table:
Obviously I don’t know what your table is called, so don’t forget to change that when you try this.
There are a few options to work around this:
I would create a new table with the correct table definition without deleting the original table; say the origin table is
table_a
then I would create a new table let’s call ittable_a_dummy
We can coalesce the null values to an "x" if it’s string, or if it’s integer then perhaps we could coalesce them as a "-1". Any values that you would think fits better.
Then I use the row_number function to get the first value of duplicated data. Then, I’d insert that into a
table_a_dummy
Assume the data types are string:
review
table_a_dummy
that should have contain cleaner version oftable_a
without risking to remove the original table.Once you have reviewed the
table_a_dummy
and you feel it’s correct — then we can safely remove thetable_a
and rename
table_a_dummy
totable_a