I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE
needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.
2
Answers
Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing
customersSecondary
because it has fewer rows, or at least it thinks it has.Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed
updatedAt
column and use that to track which records have been changed.And then you can add
updatedAt
to your join criteria and the WHERE clause –For
:last_query_run_time
you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could useNOW() - INTERVAL 65 MINUTE
. Notice I have used more than one hour to make sure records aren’t missed if there is a slight delay for some reason. Another option would be to useSELECT MAX(updatedAt) FROM customersPrimary
–Plan A:
Something like this would first find the "new" rows, then add only those:
Might
secondary
have changes? If so, a variant of that would work.Plan B:
Better yet is to
TRUNCATE TABLE secondary
after it is folded into primary.