skip to Main Content

I’m trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I’m not sure if I need to use boto3 or a different API package since it’s a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I’m fairly new at this.

So far, this is what I’ve gotten:

    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
    data = r.text
    soup = BeautifulSoup(data)
    
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))

The request to the URL is returning a 200, however, the href links[] from the ‘a’ tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.

Any help would be greatly appreciated and thank you in advance!

3

Answers


  1. Chosen as BEST ANSWER

    Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.

    This solution uses the lower level for boto3, botocore to bypass the authentication using config 'UNSIGNED'. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"

    #Install boto3
    !pip install boto3 #this includes botocore    
    
    import boto3
    from botocore import UNSIGNED
    from botocore.client import Config
    import os #this is for joining the download directory
    
    def get_s3_public_data(bucket='bucket_name'):
        #create the s3 client and assign credentials (UNSIGEND for public 
                                                         bucket)
        client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
    
        #create a list of 'Contect' objects from the s3 bucket
        list_files = client.list_objects(Bucket=cyclistic_bucket)['Contents']
    
        for key in list_files:
            if key['Key'].endswith('.zip'):
                print(f'downloading... {key["Key"]}') #print file name
                client.download_file(
                                        Bucket=bucket, #assign bucket name
                                        Key=key['Key'], #key is the file name
                                        Filename=os.path.join('./data', 
                                            key['Key']) #storage file path
                                    )
            else:
                pass #if it's not a zip file do nothing
    
    get_s3_public_data()
    

    This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.


  2. It would appear that your goal is to download files from a public Amazon S3 bucket.

    The easiest is to use the AWS Command-Line Interface (CLI). Since the bucket is public you do not require any credentials:

    aws s3 --no-sign-request sync s3://divvy-tripdata .
    
    % aws s3 --no-sign-request sync s3://divvy-tripdata .
    download: s3://divvy-tripdata/202006-divvy-tripdata.zip to ./202006-divvy-tripdata.zip
    download: s3://divvy-tripdata/202012-divvy-tripdata.zip to ./202012-divvy-tripdata.zip
    download: s3://divvy-tripdata/202007-divvy-tripdata.zip to ./202007-divvy-tripdata.zip
    download: s3://divvy-tripdata/202010-divvy-tripdata.zip to ./202010-divvy-tripdata.zip
    download: s3://divvy-tripdata/202011-divvy-tripdata.zip to ./202011-divvy-tripdata.zip
    etc
    
    Login or Signup to reply.
  3. This is the solution for those looking for the Google Data Analytics Case 1 files download from divvy-tripdata.s3.amazonaws.com/index.html who lack programming expertise but don’t want to download Case files one by one.
    At first, via terminal (assuming you use Mac), run the installation of the amazon command line interface. I spent hours trying to download via python and beautifulsoup and failed, so this is easy to do instead:

    In the terminal, run this:

    sudo easy_install awscli
    

    or (which worked better for me)

    sudo pip install awscli
    

    Either of the above will install the command line interface, and then a simple command downloads all zip files into the current folder on the hard drive.

    Run in the terminal

    aws s3 --no-sign-request sync s3://divvy-tripdata .
    

    You can play with the destination folder, of course.

    You should see this in the Terminal. as a result:

    download: s3://divvy-tripdata/202004-divvy-tripdata.zip to ./202004-divvy-tripdata.zip
    download: s3://divvy-tripdata/202005-divvy-tripdata.zip to ./202005-divvy-tripdata.zip
    download: s3://divvy-tripdata/202007-divvy-tripdata.zip to ./202007-divvy-tripdata.zip
    download: s3://divvy-tripdata/202006-divvy-tripdata.zip to ./202006-divvy-tripdata.zip
    download: s3://divvy-tripdata/202011-divvy-tripdata.zip to ./202011-divvy-tripdata.zip
    download: s3://divvy-tripdata/202102-divvy-tripdata.zip to ./202102-divvy-tripdata.zip
    download: s3://divvy-tripdata/202009-divvy-tripdata.zip to ./202009-divvy-tripdata.zip
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search