skip to Main Content

I need to have access to a 50 GB dataset I found in Kaggle[https://www.kaggle.com/]. The problem is that my computer memory and my internet connection are pretty bad, so I thought it would be a good idea to save that dataset in an S3 bucket with some lambda function without downloading it first on my computer. Kaggle has an open API that can be accessed by pip install kaggle and then kaggle datasets download theRequieredDataset to get your dataset, but I don’t know how to run this in a lambda function. Do any of you know how to do that? Is this a good approach? Do you have any other ideas or suggestions?

2

Answers


  1. It’s probably easiest to launch an Amazon EC2 instance (a t3.nano is fine) and install that toolset.

    Then, download the data to the EC2 instance and upload it to an Amazon S3 bucket.

    You should assign an IAM Role to the instance with permission to access your S3 bucket.

    Login or Signup to reply.
  2. EC2 seems most straightforward, if you want to go serverless you would need to create a specific layer in order to be able to use the kaggle package, I have not tried this, but it should work, see:

    • create a lambda layer to allow installation of the kaggle package:

    https://docs.aws.amazon.com/lambda/latest/dg/invocation-layers.html

    https://repost.aws/knowledge-center/lambda-import-module-error-python

    • aws lambda envs vars

    https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html

    code example:

    # KAGGLE_USERNAME="set these and env vars
    # KAGGLE_KEY="set these as env vars"
    
    # imports 
    import boto3
    from kaggle.api.kaggle_api_extended import KaggleApi
    
    def lambda_handler(event, context):
        # auth
        api = KaggleApi()
        api.authenticate()
    
        # Download the file from Kaggle
        api.dataset_download_file(dataset='uciml/iris', file_name='Iris.csv', path='/tmp')
    
        # Upload the file to S3
        s3 = boto3.client('s3')
        bucket_name = 'your-s3-bucket-name'
        file_name = 'Iris.csv'
        object_key = 'kaggle-data/' + file_name
        s3.upload_file('/tmp/' + file_name, bucket_name, object_key)
    
        print(f'Successfully downloaded and uploaded {file_name} to {bucket_name}/{object_key}')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search