skip to Main Content

I’m trying to perform a job search on glassdoor and crawl the results. Here’s the first page of an example search. The second page results are obtained using a fetch request which is triggered by clicking the next button.

Here’s the curl sent by the browser after clicking the next button:

curl 'https://www.glassdoor.com/graph' 
  -H 'authority: www.glassdoor.com' 
  -H 'accept: */*' 
  -H 'accept-language: en-US,en;q=0.9,ar;q=0.8' 
  -H 'apollographql-client-name: job-search' 
  -H 'apollographql-client-version: 0.27.22' 
  -H 'content-type: application/json' 
  -H 'cookie: ...' 
  -H 'gd-csrf-token: ...' 
  -H 'origin: https://www.glassdoor.com' 
  -H 'referer: https://www.glassdoor.com/' 
  -H 'sec-ch-ua: "Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"' 
  -H 'sec-ch-ua-mobile: ?0' 
  -H 'sec-ch-ua-platform: "macOS"' 
  -H 'sec-fetch-dest: empty' 
  -H 'sec-fetch-mode: cors' 
  -H 'sec-fetch-site: same-origin' 
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36' 
  --data-raw '{"operationName":"JobSearchQuery","variables":{"searchParams":{"keyword":"python","numPerPage":30,"searchType":"SR","pageNumber":2,"pageCursor":"AB4AAYEAHgAAAAAAAAAAAAAAAgMLZPsAPwEBAQ0EQ8ADTE5cfDqxexbrGvagYDN+iOmE8v7CeRN5VA9AB6c4BP2LZ3F1cbEmTsNtyle+DW0dPbIa5HyzwQAA","filterParams":[{"filterKey":"includeNoSalaryJobs","values":"true"},{"filterKey":"sc.keyword","values":"python"},{"filterKey":"locT","values":""},{"filterKey":"locId","values":""}],"seoUrl":false}},"query":"query JobSearchQuery($searchParams: SearchParams) {n  jobListings(contextHolder: {searchParams: $searchParams}) {n    ...SearchFragmentn    __typenamen  }n}nnfragment SearchFragment on JobListingSearchResults {n  adOrderJobLinkImpressionTrackingn  totalJobsCountn  filterOptionsn  companiesLinkn  searchQueryGuidn  indeedCtkn  jobSearchTrackingKeyn  paginationCursors {n    pageNumbern    cursorn    __typenamen  }n  searchResultsMetadata {n    cityPages {n      cityBlurbn      cityPagesStats {n        bestCitiesForJobsRankn        meanBaseSalaryn        populationn        unemploymentRaten        __typenamen      }n      displayNamen      employmentResources {n        addressLine1n        addressLine2n        cityNamen        namen        phoneNumbern        staten        zipCoden        __typenamen      }n      heroImagen      isLandingExperiencen      locationIdn      numJobOpeningsn      popularSearches {n        textn        urln        __typenamen      }n      __typenamen    }n    copyrightYearn    footerVO {n      countryMenu {n        childNavigationLinks {n          idn          linkn          textKeyn          __typenamen        }n        idn        linkn        textKeyn        __typenamen      }n      __typenamen    }n    helpCenterDomainn    helpCenterLocalen    isPotentialBotn    jobAlert {n      jobAlertExistsn      promptedOnJobsSearchn      promptingForJobClicksn      __typenamen    }n    jobSearchQueryn    loggedInn    searchCriteria {n      implicitLocation {n        idn        localizedDisplayNamen        typen        __typenamen      }n      keywordn      location {n        idn        localizedDisplayNamen        shortNamen        localizedShortNamen        typen        __typenamen      }n      __typenamen    }n    showMachineReadableJobsn    showMissingSearchFieldTooltipn    __typenamen  }n  companyFilterOptions {n    idn    shortNamen    __typenamen  }n  pageImpressionGuidn  pageSlotIdn  relatedCompaniesLRPn  relatedCompaniesZRPn  relatedJobTitlesn  resourceLinkn  seoTableEnabledn  jobListingSeoLinks {n    linkItems {n      positionn      urln      __typenamen    }n    __typenamen  }n  jobListings {n    jobview {n      job {n        descriptionFragmentsn        eolHashCoden        jobReqIdn        jobSourcen        jobTitleIdn        jobTitleTextn        listingIdn        __typenamen      }n      gdJobAttributes {n        salarySourcen        basePay {n          p25n          p75n          __typenamen        }n        additionalPay {n          p25n          p75n          __typenamen        }n        __typenamen      }n      jobListingAdminDetails {n        adOrderIdn        cpcValn        importConfigIdn        jobListingIdn        jobSourceIdn        userEligibleForAdminJobDetailsn        __typenamen      }n      overview {n        idn        namen        shortNamen        squareLogoUrln        __typenamen      }n      gaTrackerData {n        trackingUrln        jobViewDisplayTimeMillisn        requiresTrackingn        isIndeedJobn        searchTypeCoden        pageRequestGuidn        isSponsoredFromJobListingHitn        isSponsoredFromIndeedn        __typenamen      }n      header {n        adOrderIdn        adOrderSponsorshipLeveln        advertiserTypen        ageInDaysn        applyUrln        autoLoadApplyFormn        easyApplyn        easyApplyMethodn        employerNameFromSearchn        jobLinkn        jobCountryIdn        jobResultTrackingKeyn        locIdn        locationNamen        locationTypen        needsCommissionn        normalizedJobTitlen        organicn        payPercentile90n        payPercentile50n        payPercentile10n        hourlyWagePayPercentile {n          payPercentile90n          payPercentile50n          payPercentile10n          __typenamen        }n        ratingn        salarySourcen        sponsoredn        payPeriodn        payCurrencyn        savedJobIdn        sgocIdn        categoryMgocIdn        urgencySignal {n          labelKeyn          messageKeyn          normalizedCountn          __typenamen        }n        __typenamen      }n      __typenamen    }n    __typenamen  }n  __typenamen}n"}' 
  --compressed

which if executed directly in a terminal, for some reason despite having worked in the browser, I get a human verification message, which indicates something went wrong:

<h1>Help Us Protect Glassdoor</h1>
<p>
    Please help us protect Glassdoor by verifying that you're a
    real person. We are sorry for the inconvenience. If you continue to see this 
    message, please email
</p>

This sort of work should be available through the api which is clearly having issues / not working since I keep getting a

Glassdoor will not work properly unless browser cookie support is enabled

despite the cookies being enabled. So, what I’m trying to do is figure out how to make the curl work outside the browser, then I will repeat whatever works for getting the subsequent results.

Note: I’m aware this can be easily achieved using selenium which I’m currently using to get the search results. I don’t like the selenium approach because it’s slow and not sure about whether this can be achieved without javascript. If I get the curl approach to work, it easily be converted to a python requests approach which would be the optimal result.

2

Answers


  1. The website use to render the web page.

    You can test by yourself when you have doubt:

    In firefox, open a tab about:config, then double click: javascript.enabled, it become False. Then refresh web page:

    warning
    you don’t have expected datas.

    To fetch datas, you need a library that is JavaScript aware. Could be one of :

    With the later, you can use this script as a replacement for cURL:
    we need to fake the User-Agent to be able to retrieve datas:

    const puppeteer = require('puppeteer');
    
    (async () => {
        var url = process.argv[2];
        if (!url) {
            console.error('Usage: ' + process.argv[1] + ' <URL>');
            process.exit(1);
        }
        const browser = await puppeteer.launch({headless: true});
        const page = await browser.newPage();
        await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
        await page.goto(url, { waitUntil: 'networkidle2' });
        const html = await page.evaluate(() => document.documentElement.outerHTML);
        console.log(html);
        browser.close();
    })()
    

    Then use:

    node wget.js 'https://www.glassdoor.com/Job/python-jobs-SRCH_KO0,6.htm' > glassdoor.html
    html2text glassdoor.html
    

    As you can figure out, it’s a specific IT domain, it’s named web scraping, I can be hired to do the job as needed to render datas as clean CSV, JSON, MongoDB, or anything™. This is my specialty.


    Another solution would be to use Python + requests, sometimes is sufficient, even if it’s not Javascript capable. But you have to dig into Chrome Dev Tools, and try to figure out which request can generate by example a JSON or your needed HTML.

    Login or Signup to reply.
  2. Using Python requests (no need in this special case):

    #!/usr/bin/env python
    
    import requests
      
    session = requests.session()
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
    }
    
    page = 'https://www.glassdoor.com/Job/python-jobs-SRCH_KO0,6.htm?Autocomplete='
    res = session.get(page, headers=headers)
    print(res.text)
    

    Then, to go to next page, you need to pass the JSON object that the website does with requests in a POST request to https://www.glassdoor.com/graph.

    You can see this request in Chrome Dev Tools: Network tab.

    It’s a very big JSON.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search