skip to Main Content

I am trying to update one table based on another in the most efficient way.

Here is the table DDL of what I am trying to update

Table1

CREATE TABLE `customersPrimary` (
  `id` int NOT NULL AUTO_INCREMENT,
  `groupID` int NOT NULL,
  `IDInGroup` int NOT NULL,
  `name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Table2

CREATE TABLE `customersSecondary` (
  `groupID` int NOT NULL,
  `IDInGroup` int NOT NULL,
  `name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.

In both tables the combination of groupID and IDInGroup are unique.

Here is the query I want to optimize

UPDATE customersPrimary 
INNER JOIN customersSecondary ON 
  (customersPrimary.groupID = customersSecondary.groupID
  AND customersPrimary.IDInGroup =  customersSecondary.IDInGroup)
SET
 customersPrimary.name = customersSecondary.name, 
 customersPrimary.address = customersSecondary.address

This query works but scans EVERY row in customersSecondary.

Adding

WHERE customersPrimary.groupID = (groupID)

Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.

I can control table structure and add indexes. I will have to keep both tables.

Any suggestions would be helpful.

2

Answers


  1. Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.

    Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?

    You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.

    ALTER TABLE `customersPrimary`
        ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
        ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
    
    ALTER TABLE `customersSecondary`
        ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
        ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
    

    And then you can add updatedAt to your join criteria and the WHERE clause –

    UPDATE customersPrimary cp
    INNER JOIN customersSecondary cs
        ON cp.groupID = cs.groupID
        AND cp.IDInGroup =  cs.IDInGroup
        AND cp.updatedAt < cs.updatedAt
    SET
        cp.name = cs.name, 
        cp.address = cs.address,
        cp.updatedAt = cs.updatedAt
    WHERE cs.updatedAt > :last_query_run_time;
    

    For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren’t missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary

    UPDATE customersPrimary cp
    INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
    INNER JOIN customersSecondary cs
        ON cp.groupID = cs.groupID
        AND cp.IDInGroup =  cs.IDInGroup
        AND cp.updatedAt < cs.updatedAt
    SET
        cp.name = cs.name, 
        cp.address = cs.address,
        cp.updatedAt = cs.updatedAt
    WHERE cs.updatedAt > t.maxUpdatedAt;
    
    Login or Signup to reply.
  2. Plan A:

    Something like this would first find the "new" rows, then add only those:

    UPDATE   primary
        SET ...
        JOIN ( SELECT ...
                  FROM secondary
                  LEFT JOIN primary
                  WHERE primary... IS NULL )
              ON ...
    

    Might secondary have changes? If so, a variant of that would work.

    Plan B:

    Better yet is to TRUNCATE TABLE secondary after it is folded into primary.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search