I have a quick question regarding migrating large data sample sets from my local device to an Amazon Aurora RDS (no DMS approach).
So basically I am working on a Proof of Concept and I need to populate an Amazon Aurora DB with 2 Million rows of data. I have generated an SQL file with 2 Million INSERT commands. Now I need to get this sql file to the RDS. What is the best (by best I mean fastest) option to do this, can anyone suggest?
2
Answers
Something to consider if your data was loaded in S3 at some point. You could skip a few steps, and load directly from S3.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Obviously, this only applies if it makes sense for your data pipeline?
The answer depends on a few different things, like which database engine (PostgreSQL or MySQL) and the server settings. Here are some general things to consider. All of these work by running the mysql, psql, or whichever client program with the option for ‘run the statements in this file’.
Don’t have 2 million INSERT statements. Use multiple values in the VALUES clause for each one, e.g.
postgres=> create table t1 (x int, s varchar);
postgres=> insert into t1 values (1, ‘one’), (2, ‘two’), (3, ‘three’);
Since you have control over generating the text of the INSERT statements, you might bundle 1000 rows into each one.
Also, don’t do 2 million COMMITs, as would happen if you did 2 million INSERT statements with ‘autocommit’ turned on. Start a transaction, do N inserts, then commit. Rinse and repeat. I’m not sure offhand what the ideal value of N is. Since you already reduced the number of INSERT statements in step 1, maybe each transaction only has a few of these gigantic inserts in it.
I think you don’t want to do the whole thing in one single transaction though. Just because of the possibility of overloading memory. The right balance of number of VALUES per INSERT, number of INSERTs per transaction, that’s something I don’t have a recommendation at hand. That could also depend on how many columns are in each INSERT, how long the string values are, etc.
You can start up multiple sessions and do these transactions & inserts in parallel. No reason to wait until row 1000 is finished inserting before starting on row 50,000 or row 750,000. That means you’ll split all these statements across multiple files. One of the strengths of Aurora is handling a lot of concurrent connections like this.
Lastly, another Aurora-specific technique. (Well, it would work for RDS databases too.) Modify the DB instance to a higher-capacity instance class, do the data loading, then modify it back to the original instance class. Certain operations like data loading and engine upgrades benefit from having lots of cores and lots of memory – that can give you huge time savings. Which can be worth it to pay for a few minutes of 8xlarge or whatever, even if after that your queries run fine with a much smaller instance class.
If you don’t mind rewriting the data into CSV form or something other than actual INSERT statements, check out the mysqlimport command for MySQL, or the copy command for PostgreSQL. (copy takes the data off your local machine and so works for Aurora, whereas COPY assumes the data is on a file on the server, which you don’t have ssh or ftp access to with Aurora.)