skip to Main Content

I have a s3 bucket with some 10M objects which all needs to be processed and sent to opensearch. To this end I am evaluating if kenisis can be used for this.

The solutions online seem to imply using lambda, but since I have 10M objects, I would think the function will timeout by the time the for loop is exhausted.

So, the setup I would like is:

s3 --> (some producer) --> kenesis data streams --> opensearch (destination)

What would be the optimal way to go about this please

2

Answers


  1. Here’s an official blog post on the subject that suggests using the DMS service to send existing files in S3 to Kinesis.

    As for all the recommendations to use Lambda, those recommendations would be for the scenario where the files aren’t in S3 yet, and Lambda would be triggered each time a file is uploaded to S3, processing just that one file. Nobody is recommending you use Lambda to process 10M existing S3 files in a single function invocation.

    If you wanted to use Lambda for your current scenario, you could first create a list of all your S3 objects, and write a one-time script that feeds that list of objects into an SQS queue. Then you could have a Lambda function that processes the messages in the queue over time, taking the object key from the queue, reading the file from S3, and sending it to Kinesis. Lambda could process those in batches of up to 10 at a time. Then you could have the S3 bucket configured to send new object notifications to the same SQS queue, and Lambda would automatically processes any new objects that you add to the bucket later.

    Login or Signup to reply.
  2. Mark B‘s answer is definitely a viable option, and I’d suggest configuring your SQS queue to trigger Lambda for each message.

    Unless you need Kinesis for some ETL functionality, it’s likely that you can go from S3 to OpenSearch directly.

    Assuming the docs in S3 are formatted suitably for OpenSearch, I would take one of the following approaches:

    1. AWS Step Functions has a built-in pattern to process items in S3. This would iterate over all the objects in a chosen bucket (or folder, etc.) that match your description. Each object could then be sent to a Lambda function to save its contents to OpenSearch.
      • Assuming you have some ETL or formatting requirements, this would be easy to implement in Lambda.
      • I can’t find any documentation for the SFN S3 Patterns, but they’re available in Workflow Studio, see this screenshot.
    2. If you’re comfortable with Python, the AWS SDK for Pandas (previously AWS Data Wrangler) is a super easy option. I’ve used it extensively for moving data from CSVs, S3, and other locations into OpenSearch with ease.

    Using the AWS SDK for Pandas, you might achieve what you’re looking for like this…

    import awswrangler as wr
    from opensearchpy import OpenSearch
    
    items = wr.s3.read_json(path="s3://my-bucket/my-folder/")
    
    # connect + upload to OpenSearch
    my_client = OpenSearch(...)
    wr.opensearch.index_df(client=my_client, df=items)
    

    The AWS SDK for Pandas can iterate over chunks of S3 items, and there’s a tutorial on indexing JSON (and other file types) from S3 to OpenSearch.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search