I am trying to read the window.appCache
from a glassdoor reviews site.
url = "https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(html.content,'html.parser')
text = soup.findAll("script")[0].text
This isolates the dict I need however when I tried to do json.loads()
I get the following error:
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I checked the type of text
and it is str
.
When I print text
to a file, it looks something like this (just a snippet as the output is about 5000 lines):
window.appCache={"appName":"reviews","appVersion":"7.14.12","initialState"
{"surveyEndpoint":"https:u002Fu002Femployee-pulse-survey-b2c.us-east-1.prod.jagundi.com",
"i18nStrings":{"_":"JSON MESSAGE BUNDLE - do not remove",
"eiHeader.seeAllPhotos":"
See All Photos","eiHeader.viewJobs":"View Jobs",
"eiHeader.bptw.description":"This employer is a winner of the [year] Best Places to Work award.
Winners were determined by the people who know these companies best...
I am only concerned with the "reviews":[
field that is buried about halfway through the data, but I can’t seem to parse the string into json and retrieve what I need.
3
Answers
One solution is to parse the required data with
re
/json
module:Prints:
Well,
json.loads()
should take a string that contains a JSON document. However, the value oftext
is not a valid JSON because of thewindow.appCache=
at the beginning.And it’s not just that, I tried slicing
text
to exclude thewindow.appCache=
part:and it gave me this error:
so I checked the value of
text[68110:]
and it turns out that it was complaining because indeedtext
is not a valid JSON document:This is the result of
text[68110:]
, it is a JavaScript object, but not a valid JSON object.JSON values cannot be one of the following data types:
As you can see,
text
hasundefined
and a function as values for some fields.If you want the value of a specific field ("reviews" as you mentioned for example), I recommend parsing the string manually using maybe regular expressions or something like that.
bs4 only parses HTML, not JavaScript (nor CSS); as some of the comments have mentioned, a common approach is to split
text
at=
and usejson.loads
to parsewindow.appCache
, but in this case, that will still raise theJSONDecodeError
error becausewindow.appCache
contains js functions and js primitive values (likeundefined
).I have a function
findObj_inJS
which usesslimit
to parse a string containing JavaScript code and extract an object/variable from it. For example,findObj_inJS(text, '"reviews"')
will returnand
findObj_inJS(text, '"reviews"', findAll=True)
will return[I think you probably want
findObj_inJS(text, '"reviews"', findAll=True)[1]
]