I’m trying to collect some data from a game box score like this: https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/
The data is stored in a json file (‘data.json’) which I managed to download from network page on chrome devtools. I’ve been able to then parse it and get the data I need.
Now I’m trying to pull the json directly from the url (without downloading the file) to automate my data gathering from multiple pages of the same kind.
I’m no expert in requests from sites, especially if they are not static and the information is actively taken with a json/javascript so forgive any bad phrasing of the concepts.
This is what I’ve tried so far:
url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/"
response = urlopen(url)
data = json.loads(response.read())
#json parsing and data gathering from data
which gives the error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I then tried adding the ‘data.json’ at the end of the url:
url = "https://fibalivestats.dcd.shared.geniussports.com/u/LEGBF/2213178/data.json"
response = urlopen(url)
data = json.loads(response.read())
#json parsing and data gathering from data
which produces:
urllib.error.HTTPError: HTTP Error 403: Forbidden
From what I understand in the first case the request just comes up empty, while on the second case it is not able to open the json file.
I understood that if I don’t have manually opened the chrome devtools page the https://…/data.json page returns the error 403, however it correctly loads the data.json after I reload the page with ctr+R on the network page.
What I understand is that I need to perform some other action beyond the requests.get() or anything similar from urllib , in order to pull down the json file.
Could someone point me in the right direction?
2
Answers
Using the correct URL in your Python script correctly loads the JSON. The confusion is that you get a 403 code rather than a 404.
The 403 code is due to the permissions on the s3 bucket, as described in this blog post and in more detail in the AWS docs
If you look at the headers for the failed request, it reports that it is served by S3.
If you look at the chrome developer tools when loading the HTML page, the URL for the data actually is:
https://fibalivestats.dcd.shared.geniussports.com/data/2213178/data.json
You can use selenium. For ex. I scraped names of player You can develop and add to code what do yo want.