I need to process a large CSV from AWS S3 to Postgres DB (AWS RDS). I am new to AWS and need some suggestions on the architecture for this. Below are the requirements.
- We receive a CSV file daily in a S3 bucket. The file size is 200,000 records ~ 0.5 GB.
- We need to process the records, apply some transformations, and then load the data into a table in the Postgres Database (AWS RDS)
Some specific questions:
- I am looking for a solution that is repeatable and automated – not manual.
- Processing efficiency is a priority but does not need to be super fast (so the cost can be optimized).
- If the suggestion is to stream the file, how do we handle failures? If processing stops in the middle of the file, is there a way to restart from the failed point or do we need to start from scratch, and handle duplicates in the downstream process/ DB?
- Once the file is processed, how do we rename it or move it, so it is not processed again? What are the best practices?
Thank you.
2
Answers
You can Import data from Amazon S3 into an RDS for PostgreSQL DB instance – Amazon Relational Database Service using the
aws_s3
extension. It uses the Postgres COPY command to load CSV-style data from S3.The hard part is triggering it and knowing which objects in S3 need to be processed.
I had one situation where we dumped files each day into S3 with the same filenames, plus as additional file that contained the time that the data was dumped. We used the
pg_cron
extension to look at the files every half hour. The psql stored procedure would examine the ‘timestamp’ file and compare the date with the last load time. If it differed, the stored procedure would load all the files.If your situation is that files are being added with different filenames then it is more difficult because the psql stored procedure cannot list the objects in S3. Instead, you could use an Amazon S3 trigger to invoke an AWS Lambda function that could either:
Alternatively, it seems like you could also use AWS Glue to trigger an import process. I haven’t tried it, but here’s an article: Practical Way to Use AWS Glue with Postgresql – DEV Community
This seems like a pretty standard use of a Lambda, especially since it’s once a day. You’d setup a S3 -> Lambda trigger. When the file is dropped into S3 the Lambda is triggered. You’ll read it from S3, and process the records. You don’t go into what errors can happen but, at a guess, you may have a bad row here and there. That error can be logged, maybe sending the bad row to an SQS queue for later manual processing.
200k rows will take a bit of time – a Lambda can run up to 15 minutes so watch your initial settings.
At ~0.5GB you could read the entire file into memory, again subject to the settings for the Lambda. No streaming needed.
The Lambda will only be triggered once when the file is put into S3. You don’t have to rename or move it anywhere. If you’d prefer, however, you could move the file at the end processing to maybe a "processed" folder. But be careful. When you do this your Lambda will be triggered again and you don’t want to reprocess the file. Your code should look at the incoming event to determine if the file key contains something like
/processed
or some other indication that this is not something to worry about.The only downside with the direct S3 -> Lambda is that in the event of a catastrophic failure of your Lambda code (say an unhandled exception) is that the S3 event will be sent one time and you have to handle it then. You won’t get another event. So, for safety, change it up just a bit. Do S3 -> SQS -> Lambda instead. Your Lambda will get triggered when S3 puts an event into SQS. You read SQS, getting the event. If your Lambda completes successfully, you delete the message from SQS. Watch out for SQS message visibility timeouts if your Lambda takes a while to run. But if your Lambda crashed then, after the visibility timeout of SQS the message will "reappear" and trigger another run of the Lambda. There is a field in an SQS message that tells you the number of times the message has been received and you should pay attention to that to not create an infinite loop.