Upload Kaggle dataset into AWS S3 without downloading in my computer first - Amazon Web Sevices

Acuervov
April 26, 2023
192 views
0 votes
2 Answers

I need to have access to a 50 GB dataset I found in Kaggle[https://www.kaggle.com/]. The problem is that my computer memory and my internet connection are pretty bad, so I thought it would be a good idea to save that dataset in an S3 bucket with some lambda function without downloading it first on my computer. Kaggle has an open API that can be accessed by pip install kaggle and then kaggle datasets download theRequieredDataset to get your dataset, but I don’t know how to run this in a lambda function. Do any of you know how to do that? Is this a good approach? Do you have any other ideas or suggestions?

Answers

- JohnRotenstein
- April 26, 2023 at 1:10 am
- 0 votes
0
It’s probably easiest to launch an Amazon EC2 instance (a t3.nano is fine) and install that toolset.

Then, download the data to the EC2 instance and upload it to an Amazon S3 bucket.

You should assign an IAM Role to the instance with permission to access your S3 bucket.

Login or Signup to reply.

- Roelof
- April 26, 2023 at 1:42 am
- 0 votes
0
EC2 seems most straightforward, if you want to go serverless you would need to create a specific layer in order to be able to use the kaggle package, I have not tried this, but it should work, see:
- create a lambda layer to allow installation of the kaggle package:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-layers.html

https://repost.aws/knowledge-center/lambda-import-module-error-python
- aws lambda envs vars
https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html

code example:
```
# KAGGLE_USERNAME="set these and env vars
# KAGGLE_KEY="set these as env vars"

# imports 
import boto3
from kaggle.api.kaggle_api_extended import KaggleApi

def lambda_handler(event, context):
    # auth
    api = KaggleApi()
    api.authenticate()

    # Download the file from Kaggle
    api.dataset_download_file(dataset='uciml/iris', file_name='Iris.csv', path='/tmp')

    # Upload the file to S3
    s3 = boto3.client('s3')
    bucket_name = 'your-s3-bucket-name'
    file_name = 'Iris.csv'
    object_key = 'kaggle-data/' + file_name
    s3.upload_file('/tmp/' + file_name, bucket_name, object_key)

    print(f'Successfully downloaded and uploaded {file_name} to {bucket_name}/{object_key}')
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Upload Kaggle dataset into AWS S3 without downloading in my computer first – Amazon Web Sevices

Answers