I have 1 million records in AWS dynamodb which I want to process everyday, basically
- Read the 1 million records.
- Apply some business logic on those records.
- Finally generate the request file with the million records and write on to AWS S3 bucket.
- Also, ensure file size is under a certain size limit or number of records (say 1 million), if it goes beyond the configured size limit, split the files to accommodate the file size limit or record count limit.
- Send this request file to another application which processes the request file and sends back the response file.
- Read, process the response file and update the 1 million dynamodb records.
The processing application will be in AWS. Looking for the ideas on how to best design this application using AWS services and also keeping the failure points in mind in the design.
2
Answers
You don’t provide any requirements on what the 1M items are.
You can simply use a Lambda or multiple to do a segmented Scan and read 1M items and store the results on S3.
You can also use export to S3 feature and export the entire table to S3 each day if you like.
There are some possibilities to achieve it. The choice depends on various factors such as processing duration, cost of operation, ease of use, error handling and so on.
As an example, you could have a lambda/service that reads from DynamoDB with paginated queries, does the processing and publishes to S3. If you want to publish to S3 as a single file, keep in mind that your memory/CPU requirements might go towards bigger ends and a lambda might not work. Probably you can work with AWS Batch to parallelize it.
You can also explore publishing to Kinesis and then accumulating in your S3. Hard to say without knowing much on the system / business capabilities.
You don’t have to send S3 file to another application, and that can very well be integrated through SNS events (if the application can listen to SNS events). Again, hard to say without knowing much on the system details.
Based on what details you have provided, sounds like a batch system which processes a big amount of data periodically. So I would also suggest to look at EventBridge for scheduled events, export DDB to S3, and process S3 file using AWS Batch/EMR/Glue etc.