I am new to AWS and am about to host a website through AWS amplify. I have some data in a csv file (about 2 million rows). I get a new csv every month with updated data. The new csv has updates to previous rows and new additional rows. There is no unique identifier for each row in the csv file as any column could be adjusted in the new updated csv. I want to store the data in a dynamo db table but I am really not sure how to go about updating the data.
A couple options I have considered:
- deleting the previous dynamo db table and creating a new one with the new csv file (not sure about the logistics for this one)
- don’t use dynamo db and just read from the csv stored in s3 (not sure about the performance implications of reading from a csv each time I need to access the data)
- compare the new csv with old data in dynamo db, remove entries in dynamo db that do not occur in new csv, add all entries in new csv that do not occur in dynamo db (seems like way too many comparisons given the size of my dataset)
- just add new rows from the csv to dynamo db and ignore changes to previous rows (last resort)
I am open to any suggestions!
I have not tried any of the considered options yet.
2
Answers
Since you mentioned there is no record identifier, I assume there is no
update
operation per se, onlydelete
andinsert
.If the
deleted
records are no longer there in the new CSV, and if there is no information to figure out which rows are new (like acreate_timestamp
), option 3 is quite an expensive operation. You’d have to go through all the existing rows in your DB to figure out which ones needdelete
. You then have to go through all the new rows in your CSV to figure out which ones needinsert
.In that case, option 1 should be the most efficient option.
Alternatively, have the process that generates the CSV send you only those records to be deleted and inserted.
I’m assuming you already have a way to import the data to DynamoDB. Also, you definitely get new csv file in a defined time period. With this assumption, I would say create a TTL value for the dynamoDB records so the data will expire as defined by TTL. That way, you don’t need to worry about older data. (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html)
Also, do a cost analysis for importing the data, since there are lot of inserts.