I’m trying to perform a job search on glassdoor and crawl the results. Here’s the first page of an example search. The second page results are obtained using a fetch request which is triggered by clicking the next button.
Here’s the curl sent by the browser after clicking the next button:
curl 'https://www.glassdoor.com/graph'
-H 'authority: www.glassdoor.com'
-H 'accept: */*'
-H 'accept-language: en-US,en;q=0.9,ar;q=0.8'
-H 'apollographql-client-name: job-search'
-H 'apollographql-client-version: 0.27.22'
-H 'content-type: application/json'
-H 'cookie: ...'
-H 'gd-csrf-token: ...'
-H 'origin: https://www.glassdoor.com'
-H 'referer: https://www.glassdoor.com/'
-H 'sec-ch-ua: "Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"'
-H 'sec-ch-ua-mobile: ?0'
-H 'sec-ch-ua-platform: "macOS"'
-H 'sec-fetch-dest: empty'
-H 'sec-fetch-mode: cors'
-H 'sec-fetch-site: same-origin'
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
--data-raw '{"operationName":"JobSearchQuery","variables":{"searchParams":{"keyword":"python","numPerPage":30,"searchType":"SR","pageNumber":2,"pageCursor":"AB4AAYEAHgAAAAAAAAAAAAAAAgMLZPsAPwEBAQ0EQ8ADTE5cfDqxexbrGvagYDN+iOmE8v7CeRN5VA9AB6c4BP2LZ3F1cbEmTsNtyle+DW0dPbIa5HyzwQAA","filterParams":[{"filterKey":"includeNoSalaryJobs","values":"true"},{"filterKey":"sc.keyword","values":"python"},{"filterKey":"locT","values":""},{"filterKey":"locId","values":""}],"seoUrl":false}},"query":"query JobSearchQuery($searchParams: SearchParams) {n jobListings(contextHolder: {searchParams: $searchParams}) {n ...SearchFragmentn __typenamen }n}nnfragment SearchFragment on JobListingSearchResults {n adOrderJobLinkImpressionTrackingn totalJobsCountn filterOptionsn companiesLinkn searchQueryGuidn indeedCtkn jobSearchTrackingKeyn paginationCursors {n pageNumbern cursorn __typenamen }n searchResultsMetadata {n cityPages {n cityBlurbn cityPagesStats {n bestCitiesForJobsRankn meanBaseSalaryn populationn unemploymentRaten __typenamen }n displayNamen employmentResources {n addressLine1n addressLine2n cityNamen namen phoneNumbern staten zipCoden __typenamen }n heroImagen isLandingExperiencen locationIdn numJobOpeningsn popularSearches {n textn urln __typenamen }n __typenamen }n copyrightYearn footerVO {n countryMenu {n childNavigationLinks {n idn linkn textKeyn __typenamen }n idn linkn textKeyn __typenamen }n __typenamen }n helpCenterDomainn helpCenterLocalen isPotentialBotn jobAlert {n jobAlertExistsn promptedOnJobsSearchn promptingForJobClicksn __typenamen }n jobSearchQueryn loggedInn searchCriteria {n implicitLocation {n idn localizedDisplayNamen typen __typenamen }n keywordn location {n idn localizedDisplayNamen shortNamen localizedShortNamen typen __typenamen }n __typenamen }n showMachineReadableJobsn showMissingSearchFieldTooltipn __typenamen }n companyFilterOptions {n idn shortNamen __typenamen }n pageImpressionGuidn pageSlotIdn relatedCompaniesLRPn relatedCompaniesZRPn relatedJobTitlesn resourceLinkn seoTableEnabledn jobListingSeoLinks {n linkItems {n positionn urln __typenamen }n __typenamen }n jobListings {n jobview {n job {n descriptionFragmentsn eolHashCoden jobReqIdn jobSourcen jobTitleIdn jobTitleTextn listingIdn __typenamen }n gdJobAttributes {n salarySourcen basePay {n p25n p75n __typenamen }n additionalPay {n p25n p75n __typenamen }n __typenamen }n jobListingAdminDetails {n adOrderIdn cpcValn importConfigIdn jobListingIdn jobSourceIdn userEligibleForAdminJobDetailsn __typenamen }n overview {n idn namen shortNamen squareLogoUrln __typenamen }n gaTrackerData {n trackingUrln jobViewDisplayTimeMillisn requiresTrackingn isIndeedJobn searchTypeCoden pageRequestGuidn isSponsoredFromJobListingHitn isSponsoredFromIndeedn __typenamen }n header {n adOrderIdn adOrderSponsorshipLeveln advertiserTypen ageInDaysn applyUrln autoLoadApplyFormn easyApplyn easyApplyMethodn employerNameFromSearchn jobLinkn jobCountryIdn jobResultTrackingKeyn locIdn locationNamen locationTypen needsCommissionn normalizedJobTitlen organicn payPercentile90n payPercentile50n payPercentile10n hourlyWagePayPercentile {n payPercentile90n payPercentile50n payPercentile10n __typenamen }n ratingn salarySourcen sponsoredn payPeriodn payCurrencyn savedJobIdn sgocIdn categoryMgocIdn urgencySignal {n labelKeyn messageKeyn normalizedCountn __typenamen }n __typenamen }n __typenamen }n __typenamen }n __typenamen}n"}'
--compressed
which if executed directly in a terminal, for some reason despite having worked in the browser, I get a human verification message, which indicates something went wrong:
<h1>Help Us Protect Glassdoor</h1>
<p>
Please help us protect Glassdoor by verifying that you're a
real person. We are sorry for the inconvenience. If you continue to see this
message, please email
</p>
This sort of work should be available through the api which is clearly having issues / not working since I keep getting a
Glassdoor will not work properly unless browser cookie support is enabled
despite the cookies being enabled. So, what I’m trying to do is figure out how to make the curl work outside the browser, then I will repeat whatever works for getting the subsequent results.
Note: I’m aware this can be easily achieved using selenium
which I’m currently using to get the search results. I don’t like the selenium approach because it’s slow and not sure about whether this can be achieved without javascript. If I get the curl
approach to work, it easily be converted to a python requests
approach which would be the optimal result.
2
Answers
The website use javascript to render the web page.
You can test by yourself when you have doubt:
In
firefox
, open a tababout:config
, then double click:javascript.enabled
, it becomeFalse
. Then refresh web page:you don’t have expected datas.
To fetch datas, you need a library that is
JavaScript
aware. Could be one of :Python
,JavaScript
orJava
+Selenium
Python
+Playwright
Python
+request-HTML
JavaSript
nodejs + PuppeteerWith the later, you can use this script as a replacement for
cURL
:we need to fake the User-Agent to be able to retrieve datas:
Then use:
As you can figure out, it’s a specific IT domain, it’s named web scraping, I can be hired to do the job as needed to render datas as clean
CSV
,JSON
,MongoDB
, or anything™. This is my specialty.Another solution would be to use
Python
+requests
, sometimes is sufficient, even if it’s notJavascript
capable. But you have to dig intoChrome Dev Tools
, and try to figure out which request can generate by example aJSON
or your neededHTML
.Using
Python
requests
(no need javascript in this special case):Then, to go to next page, you need to pass the
JSON
object that the website does withrequests
in aPOST
request tohttps://www.glassdoor.com/graph
.You can see this request in Chrome Dev Tools: Network tab.
It’s a very big
JSON
.