I need to have access to a 50 GB dataset I found in Kaggle[https://www.kaggle.com/]. The problem is that my computer memory and my internet connection are pretty bad, so I thought it would be a good idea to save that dataset in an S3 bucket with some lambda function without downloading it first on my computer. Kaggle has an open API that can be accessed by pip install kaggle
and then kaggle datasets download theRequieredDataset
to get your dataset, but I don’t know how to run this in a lambda function. Do any of you know how to do that? Is this a good approach? Do you have any other ideas or suggestions?
Question posted in Amazon Web Sevices
The official Amazon Web Services documentation can be found here.
The official Amazon Web Services documentation can be found here.
2
Answers
It’s probably easiest to launch an Amazon EC2 instance (a
t3.nano
is fine) and install that toolset.Then, download the data to the EC2 instance and upload it to an Amazon S3 bucket.
You should assign an IAM Role to the instance with permission to access your S3 bucket.
EC2 seems most straightforward, if you want to go serverless you would need to create a specific layer in order to be able to use the kaggle package, I have not tried this, but it should work, see:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-layers.html
https://repost.aws/knowledge-center/lambda-import-module-error-python
https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
code example: