I’m trying to download some files from a public s3 bucket as part of the Google Analytics course. However, I am not getting the links returned in my request. I’m not sure if I need to use boto3 or a different API package since it’s a public URL with visible links. Reading the docs from Boto3, I am not 100% sure on how I would list the zip files that are list on the page links. Sorry I’m fairly new at this.
So far, this is what I’ve gotten:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://divvy-tripdata.s3.amazonaws.com/index.html')
data = r.text
soup = BeautifulSoup(data)
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
The request to the URL is returning a 200, however, the href links[] from the ‘a’ tags are coming up empty. I am trying to get all of the hrefs so I can create a loop to download the files with an urllib.request. to the base URL with a /filename for each zip file.
Any help would be greatly appreciated and thank you in advance!
3
Answers
Thank you for your comments. While the AWS CLI worked just fine, I wanted to bake this into my python script for future reference ease of access. As such, I was able to figured out how to download the zip files using boto3.
This solution uses the lower level for boto3, botocore to bypass the authentication using config 'UNSIGNED'. I found out about this through another Github project called s3-key-listener which "List all keys in any public Amazon s3 bucket, option to check if each object is public or private. Saves result as a .csv file"
This connects to the s3 bucket and fetches the zip files for me. Hope this helps anyone else dealing with a similar issue.
It would appear that your goal is to download files from a public Amazon S3 bucket.
The easiest is to use the AWS Command-Line Interface (CLI). Since the bucket is public you do not require any credentials:
This is the solution for those looking for the Google Data Analytics Case 1 files download from divvy-tripdata.s3.amazonaws.com/index.html who lack programming expertise but don’t want to download Case files one by one.
At first, via terminal (assuming you use Mac), run the installation of the amazon command line interface. I spent hours trying to download via python and beautifulsoup and failed, so this is easy to do instead:
In the terminal, run this:
or (which worked better for me)
Either of the above will install the command line interface, and then a simple command downloads all zip files into the current folder on the hard drive.
Run in the terminal
You can play with the destination folder, of course.
You should see this in the Terminal. as a result: