I have several queries along the lines of:
INSERT INTO my_data_archive.my_table (id, FieldName, Value, Position, PageId)
SELECT id, FieldName, Value, Position, PageId FROM my_data.my_table
WHERE PageId IN
(SELECT id FROM my_data.page WHERE PageTypeId BETWEEN 33 AND 36 AND Deleted IS NULL)
I also have some queries like
INSERT INTO my_data_archive.my_table (id, FieldName, Value, Position, PageId)
SELECT id, FieldName, Value, Position, PageId from my_data.my_table
WHERE PageId IN (SELECT id FROM my_data.page WHERE CreatedTimestamp <= DATE_SUB(CURRENT_DATE(),INTERVAL 3 YEAR) AND Deleted IS NULL)
Basically these are bulk inserts designed to archive off old data into a new schema. The query takes all the data (in my_table) that belongs to ‘pages’ (in the page table) of a certain type or older than 3 years and inserts it into the archive table.
The performance of the inserts is very slow due to the amount of data and probably a pretty suboptimal query. What would be the best way to improve the performance of this insert? Or is there a better approach?
2
Answers
Thanks. The inserts are by far the longest part of the process.
In my case MySQL seems to do a reasonable job of optimising the IN query. The JOIN didn't really improve the SELECT performance at all. I suspect in many cases that would be the answer though. What does help is outputting the records into a file with OUTFILE and inserting them via LOAD DATA INFILE.
At least by testing a chunk of the table's data by limiting the records output to the file I could calculate how long doing the whole lot would take. In the end for the entire 20 million rows it took about an hour to insert which is good enough for what I'm doing. I can't see any way to further improve on this much but open to more suggestions.
Avoid the construct
IN ( SELECT ... )
. Change to aJOIN
. For example:And add these composite indexes to
page
:The
INSERT
part of your query will take some time. Hopefully the above suggestions will help enough.