I have a s3 bucket with some 10M objects which all needs to be processed and sent to opensearch. To this end I am evaluating if kenisis can be used for this.
The solutions online seem to imply using lambda, but since I have 10M objects, I would think the function will timeout by the time the for loop is exhausted.
So, the setup I would like is:
s3 --> (some producer) --> kenesis data streams --> opensearch (destination)
What would be the optimal way to go about this please
2
Answers
Here’s an official blog post on the subject that suggests using the DMS service to send existing files in S3 to Kinesis.
As for all the recommendations to use Lambda, those recommendations would be for the scenario where the files aren’t in S3 yet, and Lambda would be triggered each time a file is uploaded to S3, processing just that one file. Nobody is recommending you use Lambda to process 10M existing S3 files in a single function invocation.
If you wanted to use Lambda for your current scenario, you could first create a list of all your S3 objects, and write a one-time script that feeds that list of objects into an SQS queue. Then you could have a Lambda function that processes the messages in the queue over time, taking the object key from the queue, reading the file from S3, and sending it to Kinesis. Lambda could process those in batches of up to 10 at a time. Then you could have the S3 bucket configured to send new object notifications to the same SQS queue, and Lambda would automatically processes any new objects that you add to the bucket later.
Mark B‘s answer is definitely a viable option, and I’d suggest configuring your SQS queue to trigger Lambda for each message.
Unless you need Kinesis for some ETL functionality, it’s likely that you can go from S3 to OpenSearch directly.
Assuming the docs in S3 are formatted suitably for OpenSearch, I would take one of the following approaches:
Using the AWS SDK for Pandas, you might achieve what you’re looking for like this…
The AWS SDK for Pandas can iterate over chunks of S3 items, and there’s a tutorial on indexing JSON (and other file types) from S3 to OpenSearch.