Curl copied from chrome devtools, yields different response in terminal. Javascript rendered web site

nlblack323
May 14, 2023
142 views
0 votes
2 Answers

I’m trying to perform a job search on glassdoor and crawl the results. Here’s the first page of an example search. The second page results are obtained using a fetch request which is triggered by clicking the next button.

Here’s the curl sent by the browser after clicking the next button:

curl 'https://www.glassdoor.com/graph' 
  -H 'authority: www.glassdoor.com' 
  -H 'accept: */*' 
  -H 'accept-language: en-US,en;q=0.9,ar;q=0.8' 
  -H 'apollographql-client-name: job-search' 
  -H 'apollographql-client-version: 0.27.22' 
  -H 'content-type: application/json' 
  -H 'cookie: ...' 
  -H 'gd-csrf-token: ...' 
  -H 'origin: https://www.glassdoor.com' 
  -H 'referer: https://www.glassdoor.com/' 
  -H 'sec-ch-ua: "Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"' 
  -H 'sec-ch-ua-mobile: ?0' 
  -H 'sec-ch-ua-platform: "macOS"' 
  -H 'sec-fetch-dest: empty' 
  -H 'sec-fetch-mode: cors' 
  -H 'sec-fetch-site: same-origin' 
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36' 
  --data-raw '{"operationName":"JobSearchQuery","variables":{"searchParams":{"keyword":"python","numPerPage":30,"searchType":"SR","pageNumber":2,"pageCursor":"AB4AAYEAHgAAAAAAAAAAAAAAAgMLZPsAPwEBAQ0EQ8ADTE5cfDqxexbrGvagYDN+iOmE8v7CeRN5VA9AB6c4BP2LZ3F1cbEmTsNtyle+DW0dPbIa5HyzwQAA","filterParams":[{"filterKey":"includeNoSalaryJobs","values":"true"},{"filterKey":"sc.keyword","values":"python"},{"filterKey":"locT","values":""},{"filterKey":"locId","values":""}],"seoUrl":false}},"query":"query JobSearchQuery($searchParams: SearchParams) {n  jobListings(contextHolder: {searchParams: $searchParams}) {n    ...SearchFragmentn    __typenamen  }n}nnfragment SearchFragment on JobListingSearchResults {n  adOrderJobLinkImpressionTrackingn  totalJobsCountn  filterOptionsn  companiesLinkn  searchQueryGuidn  indeedCtkn  jobSearchTrackingKeyn  paginationCursors {n    pageNumbern    cursorn    __typenamen  }n  searchResultsMetadata {n    cityPages {n      cityBlurbn      cityPagesStats {n        bestCitiesForJobsRankn        meanBaseSalaryn        populationn        unemploymentRaten        __typenamen      }n      displayNamen      employmentResources {n        addressLine1n        addressLine2n        cityNamen        namen        phoneNumbern        staten        zipCoden        __typenamen      }n      heroImagen      isLandingExperiencen      locationIdn      numJobOpeningsn      popularSearches {n        textn        urln        __typenamen      }n      __typenamen    }n    copyrightYearn    footerVO {n      countryMenu {n        childNavigationLinks {n          idn          linkn          textKeyn          __typenamen        }n        idn        linkn        textKeyn        __typenamen      }n      __typenamen    }n    helpCenterDomainn    helpCenterLocalen    isPotentialBotn    jobAlert {n      jobAlertExistsn      promptedOnJobsSearchn      promptingForJobClicksn      __typenamen    }n    jobSearchQueryn    loggedInn    searchCriteria {n      implicitLocation {n        idn        localizedDisplayNamen        typen        __typenamen      }n      keywordn      location {n        idn        localizedDisplayNamen        shortNamen        localizedShortNamen        typen        __typenamen      }n      __typenamen    }n    showMachineReadableJobsn    showMissingSearchFieldTooltipn    __typenamen  }n  companyFilterOptions {n    idn    shortNamen    __typenamen  }n  pageImpressionGuidn  pageSlotIdn  relatedCompaniesLRPn  relatedCompaniesZRPn  relatedJobTitlesn  resourceLinkn  seoTableEnabledn  jobListingSeoLinks {n    linkItems {n      positionn      urln      __typenamen    }n    __typenamen  }n  jobListings {n    jobview {n      job {n        descriptionFragmentsn        eolHashCoden        jobReqIdn        jobSourcen        jobTitleIdn        jobTitleTextn        listingIdn        __typenamen      }n      gdJobAttributes {n        salarySourcen        basePay {n          p25n          p75n          __typenamen        }n        additionalPay {n          p25n          p75n          __typenamen        }n        __typenamen      }n      jobListingAdminDetails {n        adOrderIdn        cpcValn        importConfigIdn        jobListingIdn        jobSourceIdn        userEligibleForAdminJobDetailsn        __typenamen      }n      overview {n        idn        namen        shortNamen        squareLogoUrln        __typenamen      }n      gaTrackerData {n        trackingUrln        jobViewDisplayTimeMillisn        requiresTrackingn        isIndeedJobn        searchTypeCoden        pageRequestGuidn        isSponsoredFromJobListingHitn        isSponsoredFromIndeedn        __typenamen      }n      header {n        adOrderIdn        adOrderSponsorshipLeveln        advertiserTypen        ageInDaysn        applyUrln        autoLoadApplyFormn        easyApplyn        easyApplyMethodn        employerNameFromSearchn        jobLinkn        jobCountryIdn        jobResultTrackingKeyn        locIdn        locationNamen        locationTypen        needsCommissionn        normalizedJobTitlen        organicn        payPercentile90n        payPercentile50n        payPercentile10n        hourlyWagePayPercentile {n          payPercentile90n          payPercentile50n          payPercentile10n          __typenamen        }n        ratingn        salarySourcen        sponsoredn        payPeriodn        payCurrencyn        savedJobIdn        sgocIdn        categoryMgocIdn        urgencySignal {n          labelKeyn          messageKeyn          normalizedCountn          __typenamen        }n        __typenamen      }n      __typenamen    }n    __typenamen  }n  __typenamen}n"}' 
  --compressed

which if executed directly in a terminal, for some reason despite having worked in the browser, I get a human verification message, which indicates something went wrong:

<h1>Help Us Protect Glassdoor</h1>
<p>
    Please help us protect Glassdoor by verifying that you're a
    real person. We are sorry for the inconvenience. If you continue to see this 
    message, please email
</p>

This sort of work should be available through the api which is clearly having issues / not working since I keep getting a

Glassdoor will not work properly unless browser cookie support is enabled

despite the cookies being enabled. So, what I’m trying to do is figure out how to make the curl work outside the browser, then I will repeat whatever works for getting the subsequent results.

Note: I’m aware this can be easily achieved using selenium which I’m currently using to get the search results. I don’t like the selenium approach because it’s slow and not sure about whether this can be achieved without javascript. If I get the curl approach to work, it easily be converted to a python requests approach which would be the optimal result.

Answers

- GillesQu233not
- May 12, 2023 at 2:32 pm
- 0 votes
0
The website use javascript to render the web page.

You can test by yourself when you have doubt:

In firefox, open a tab about:config, then double click: javascript.enabled, it become False. Then refresh web page:

you don’t have expected datas.

To fetch datas, you need a library that is JavaScript aware. Could be one of :
- Python, JavaScript or Java + Selenium
- Python + Playwright
- Python + request-HTML
- JavaSript nodejs + Puppeteer
With the later, you can use this script as a replacement for cURL:
we need to fake the User-Agent to be able to retrieve datas:
```
const puppeteer = require('puppeteer');

(async () => {
    var url = process.argv[2];
    if (!url) {
        console.error('Usage: ' + process.argv[1] + ' <URL>');
        process.exit(1);
    }
    const browser = await puppeteer.launch({headless: true});
    const page = await browser.newPage();
    await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
    await page.goto(url, { waitUntil: 'networkidle2' });
    const html = await page.evaluate(() => document.documentElement.outerHTML);
    console.log(html);
    browser.close();
})()
```
Then use:
```
node wget.js 'https://www.glassdoor.com/Job/python-jobs-SRCH_KO0,6.htm' > glassdoor.html
html2text glassdoor.html
```
As you can figure out, it’s a specific IT domain, it’s named web scraping, I can be hired to do the job as needed to render datas as clean CSV, JSON, MongoDB, or anything™. This is my specialty.

Another solution would be to use Python + requests, sometimes is sufficient, even if it’s not Javascript capable. But you have to dig into Chrome Dev Tools, and try to figure out which request can generate by example a JSON or your needed HTML.
Login or Signup to reply.

- GillesQu233not
- May 12, 2023 at 5:03 pm
- 0 votes
0
Using Python requests (no need javascript in this special case):
```
#!/usr/bin/env python

import requests
  
session = requests.session()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

page = 'https://www.glassdoor.com/Job/python-jobs-SRCH_KO0,6.htm?Autocomplete='
res = session.get(page, headers=headers)
print(res.text)
```
Then, to go to next page, you need to pass the JSON object that the website does with requests in a POST request to https://www.glassdoor.com/graph.

You can see this request in Chrome Dev Tools: Network tab.

It’s a very big JSON.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.