skip to Main Content

I have several queries along the lines of:

INSERT INTO my_data_archive.my_table (id, FieldName, Value, Position, PageId) 
  SELECT id, FieldName, Value, Position, PageId FROM my_data.my_table 
  WHERE PageId IN 
    (SELECT id FROM my_data.page WHERE PageTypeId BETWEEN 33 AND 36 AND Deleted IS NULL)

I also have some queries like

INSERT INTO my_data_archive.my_table (id, FieldName, Value, Position, PageId) 
  SELECT id, FieldName, Value, Position, PageId from my_data.my_table 
  WHERE PageId IN (SELECT id FROM my_data.page WHERE CreatedTimestamp <= DATE_SUB(CURRENT_DATE(),INTERVAL 3 YEAR) AND Deleted IS NULL)

Basically these are bulk inserts designed to archive off old data into a new schema. The query takes all the data (in my_table) that belongs to ‘pages’ (in the page table) of a certain type or older than 3 years and inserts it into the archive table.

The performance of the inserts is very slow due to the amount of data and probably a pretty suboptimal query. What would be the best way to improve the performance of this insert? Or is there a better approach?

2

Answers


  1. Chosen as BEST ANSWER

    Thanks. The inserts are by far the longest part of the process.

    In my case MySQL seems to do a reasonable job of optimising the IN query. The JOIN didn't really improve the SELECT performance at all. I suspect in many cases that would be the answer though. What does help is outputting the records into a file with OUTFILE and inserting them via LOAD DATA INFILE.

    At least by testing a chunk of the table's data by limiting the records output to the file I could calculate how long doing the whole lot would take. In the end for the entire 20 million rows it took about an hour to insert which is good enough for what I'm doing. I can't see any way to further improve on this much but open to more suggestions.


  2. Avoid the construct IN ( SELECT ... ). Change to a JOIN. For example:

    SELECT  t.id, t.FieldName, t.Value, t.Position, t.PageId
        FROM  my_data.my_table AS t
        JOIN  my_data.page AS p  ON p.id = t.PageId
        WHERE  p.PageTypeId BETWEEN 33 AND 36
          AND  p.Deleted IS NULL ) 
    

    And add these composite indexes to page:

    INDEX(Deleted, CreatedTimestamp)
    INDEX(Deleted, PageTypeId)
    

    The INSERT part of your query will take some time. Hopefully the above suggestions will help enough.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search