skip to Main Content

I have a table companies in which there are some duplicated rows I found with a simple GROUP BY name.

This companies table has a PRIMARY KEY id which is referenced in several other tables, under the name company_id, which I also found with a query on the INFORMATION_SCHEMA.

Now my requirement: Merge the two companies into one, meaning update all the referenced company_id to reference the first id of the two, then delete the second company

All the FOREIGN KEYS are already declared with ON UPDATE CASCADE clause but I’m not sure if there’s a smart way of accomplishing the final result without manually updating all the referenced FOREIGN KEYS

My intuition was to

  • create a new temporary field mark_to_delete and set this = 1 on all the duplicated companies I want to delete
  • use FOREIGN_KEY_CHECKS=0 to avoid errors while updating a company.ID to an existing one
  • Delete companies that are mark_to_delete=1
  • Turn back on FOREIGN_KEY_CHECKS=1

Clearly my biggest doubt is on how FOREIGN_KEY_CHECKS works, I’ve tried reading the docs and searching for my specific case, but I can’t understand if the ON UPDATE CASCADE will work or not.

Worst case scenario I think would be to list all the referenced company ID’s, run updates on the tables with a referenced ID setting it to the "healthy" company, and then delete the duplicated company.

2

Answers


  1. You don’t need to change FOREIGN_KEY_CHECKS and you don’t need to think about ON UPDATE CASCADE.

    What you call the "worst case scenario" is actually easier.

    1. Identify duplicate companies. Say companies with id 1, 2, 14, 39 are duplicates of each other.

    2. Pick one to use as the canonical row for that given company. Let’s say it’s 1.

    3. Update any references in child tables to the other duplicate company ids:

      UPDATE other_table SET company_id = 1
      WHERE company_id IN (2, 14, 39);
      

      This does not require changing FOREIGN_KEY_CHECKS, because the company_id references satisfy the constraint before and after the UPDATE.

      This does not incur ON UPDATE CASCADE. Cascading is for when you update the referenced primary key, but you aren’t doing that. You’re changing the foreign keys in child table(s).

    4. Repeat for any other tables that reference companies.

    5. Once the duplicate rows in companies have no references in child tables, you are free to delete them.

      DELETE FROM companies WHERE id IN (2, 14, 39);
      

    I recommend double-checking all the other columns of companies before choosing the canonical row. Any of them could have useful data (e.g. current phone number), and you’d have to read them and merge manually.

    Login or Signup to reply.
  2. UPDATE slave
      JOIN main m1 USING (main_id)
      JOIN ( SELECT MIN(main_id) main_id, main_name
             FROM main m2
             GROUP BY 2
             ) m3 USING (main_name)
    SET slave.main_id = m3.main_id;
    

    The duplicates in the sample are the rows in main table with the same main_name value. So the subquery selects minimal main_id for all duplicates, and all references are replaced with this value.

    Complete sample fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search