How can I use AWS tools to process a large list of email addresses in a Python script more quickly? I have a script that iterates through the list and passes each email address to a function, but even with multi-threading, it’s not processing quickly enough to meet my deadline.
I’m wondering if it’s possible to use AWS Step Functions and/or containerization to pass each email address to its own node/instance. Can someone walk me through the steps to achieve this?
2
Answers
One option would be:
For testing purposes, just send one message to the queue at a time until you have the code working.
There is a default limit of 1000 concurrent Lambda functions. Therefore, if you send many messages to the queue, they will be processed in parallel.
I think Step Functions Distributed Map would be a great fit here. You can use it to iterate through your list and run a workflow execution for each (or batch them up if you find that helpful).
As for how you process each of these, you have options there too. If each of the processing steps for each email address will take under 15 minutes, you can use Lambda for this and compose the steps using Step Functions. If each step is going to take longer, then you might want to look at the Optimized Integration with ECS to implement your processing using ECS. Or if you just want to run these tasks on EC2 instances, you can use Activities for that. And you can mix and match as you need to.
To get started, I’d encourage you to check out the module in The Step Functions Workshop. And if you want a broader overview, a few of us gave a presentation on Distributed Map at re:Invent last year which you can find here on Youtube.