skip to Main Content

I have intermediate AWS knowledge and have an issue, where I can see multiple ways of solving it, and I’m looking for opinions from more skilled AWS architects.

I have an on-premise system that produces ~30k XML files (each <100KB) throughout the day. These XML files have to be sent to AWS to be parsed.

Possible solutions:

  1. Feed the XML files to a Kinesis Firehose (presumably via an API gateway) that parses each file in a lambda and also stores the raw files in S3. This is, in my opinion, the ideal solution (?). A new XML file is created approx. every 3 seconds.
  2. Upload each XML file via a presigned S3 URL and trigger a lambda that parses it. This involves fetching a presigned URL for every file from an API gateway. I am unsure whether this is a good approach for files that are produced at the above mentioned frequency.
  3. Same as above, but use SFTP for the upload.

Out of these 3 solutions, I think option 1 is most suitable, but I’m eager to hear opinions on this.

There is also a scenario, where the XML files are collected every day into a batch of ~30k files. For that case, I have the following questions:

  • Does option 1 still make sense, even though the firehose is fed only once a day with a large amount of files at once?
  • Does option 2/3 still make sense? Another possibility would be to upload a single zip, unzip it with a lambda into another folder, where again, a lambda is triggered for every new files. That would mean 30k lambdas being triggered at once.
  • There are S3 batch operation, that apply a lambda to every file in a predetermined list of files. Where is the difference to the zip-version of option 2? It also seems that these s3 batch operations cannot be provisioned with terraform, which is a disadvantage for me.

I understand that this question is not super specific, but I would appreciate help.

One concrete question that I have is: does triggering 30k lambdas "at once" pose an issue? The tasks are not time-sensitive, so it’s not a problem that "only" 1k lambdas are running in parallell, as long as all of them eventually run.

3

Answers


  1. It depends on how you want to process the files and each file processing time and whether you want real time processsing or batch style.

    firehose + lambda:

    Though the firehose + lambda option is more convenient to setup, frequent execution of lambda ( every 3 seconds) will have higher cost involved. Same case will be for kinesis firehose, that it charges based on amount of data processed. So if you are up for it, I would say do a cost analysis on your case.

    Alternative approach:

    If you have your files every 3 seconds created and have condition of real-time processing, a long running microservice is recommended to read the file from s3 and process/parse it and do the further processing.

    If you are okay with processing the files once per day at certain desirable time, choosing an AWS ECS scheduled task with any batch framework like spring batch will be beneficial. With that you can have AWS ECS scheduled task, running at specific time and processing your files reading it from s3. after the processing, the job shut down, so it wont incur any cost.

    Ref: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html

    Login or Signup to reply.
  2. As you mentioned "batch", you should explore AWS Batch: https://aws.amazon.com/batch/

    If you have not heard about AWS Batch, here is the description:

    What is AWS Batch?

    AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized compute resources) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters, allowing you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon ECS, Amazon EKS, and AWS Fargate with an option to utilize spot instances.


    To utilize AWS Batch, you create a container that contain whatever code and business logic that you want it to execute, and you execute it against a target storage, like S3.

    The good thing about AWS Batch is you can use Spot Instance which will save you money, and if that instance was interrupted and the task did not complete, AWS Batch will reschedule another replacement instance to redo the task.

    It does not matter if you want to process 1 file or 1 million file in your S3 bucket.

    And to transfer your local file to S3 bucket, you can use AWS Storage Gateway to sync/upload your local file to S3, and trigger the AWS Batch process: https://aws.amazon.com/storagegateway/

    Login or Signup to reply.
  3. The cheapest solution so far would be using the distributed map feature of AWS Step Functions.

    Regarding your file upload you need to decide how fast you need your data be accessible after being processed. Hence uploading all in one batch or when they occur is bound to that.

    Independently of your upload interval I would use the event from S3 that a new file has arrived, batch them and process them with Step Functions. To reduce costs, use an Express Workflow.

    There are a couple of file integrations from on premise to cloud. You can write a script that puts files into S3 directly, upload them via API Gateway or use more sophisticated services like AWS DataSync or AWS Storage Gateway. Depending on how stable your connection is, you could directly mount S3 into your filesystem like described here.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search