skip to Main Content

I have been recently challenged with an architectural problem. Basically, I developed a Node.js application that fetches three zip files from Census.gov (13 MB, 1.2 MB and 6.7 GB) which takes about 15 to 20 minutes. After the files are downloaded the application unzips these files and extracts needed data to an AWS RDS Database. The issue for me is that this application needs to run only one time each year. What would be the best solution for this kind of task? Also, the zip files are deleted after the processing is done.

4

Answers


  1. I would use a cron job. You can use this website (https://crontab.guru/every-year) to determine the correct settings for the crontab.

    0 0 1 12 1
    

    This setting will run β€œAt 00:00 on day-of-month 1 and on Monday in December.”

    To run the nodeJS program you simply put node yourcode.js aftewards. So it would look like the code below. Where node is you may need to put the path to the node program, and where yourprogram.js is you simply need to add the path there as well.

    0 0 1 12 1 node yourprogram.js
    
    Login or Signup to reply.
  2. Hei, I would give u suggestion. But according what Services do you use. In example if using Google Cloud with Google Scheduller. If using Openshift or another u can use Cronjob. But it worst case configuration I think where u need make some yaml file deployment that need trigger to publisher/subscriber:

    1. Make some subscriber, on services which can trigger by Google PubSub by Topic to do your task and after all executed publish to the broker (Google PubSuB) again.
    2. And than make another subscriber to trigger deleting file after receive a publisher message if all task execute.

    The Idea i suggest because the process like that, it best practices if using the Asyncrhrounus process.

    Thanks,

    Login or Signup to reply.
    1. I would look into AWS Batch service which can run a scheduled job on an EC2 instance (virtual machine) or Fargate (serverless container runner).

    2. Alternative #2: Use AWS Lambda serverless function to execute a NodeJS script (no need to set up an EC2 Instance or Fargate). Lambda functions can be triggered by EventBridge Rules using cron expressions. With Lambda, you pay for number of executions and the execution time in 1ms increments, however this use case could be covered within the AWS Free Tier Lambda pricing. AWS Free Tier

      • Note on Lambda limits: Lambda execution time is limited to 15 minutes and 10GB of local storage maximum (source: Lambda Quotas). Lambda CPU is allocated in proportion to memory configuration, you may need to increase it to improve execution time. Lambda Memory Configuration
    3. Alternative #3: You can build a state machine using AWS Step Functions to trigger Lambda functions in steps.

      • For example, a state machine can trigger three Lambda functions in parallel where each function downloads its corresponding .zip file from census.gov and stores it to an Amazon S3 bucket. When all functions complete, the state machine can progress to next step and trigger a fourth function to grab data from S3 for processing and loading into the database. Once the data has been processed and loaded, the final step function can delete the .zip files from S3 if you no longer need them. EventBridge can also be used here to execute the state machine using a cron expression. You can also use Amazon SNS to publish notifications (email/sms/http endpoint) to alert if any step fails/completes.
    Login or Signup to reply.
  3. The simple solution is to Schedule AWS Lambda Functions Using CloudWatch Events

    So, you will have an AWS lambda function that will download the .zip files in the S3 buckets, unzip it and extract the data to database. After that, the same function can empty the S3 buckets.

    This function will be yearly trigger by CloudWatch Events.

    For more information, check out this tutorial here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search