skip to Main Content

I have a rather strange problem that I can’t explain.

Context: as part of a project, I created a webscraping script (Selenium + Beautifulsoup) to retrieve job offers in France on Linkedin for Data professions (Data Analyst, Data Engineer, Data Scientist, etc.).

I built this script on my own computer, using the options not to display the page, to open the site in incognito mode (so I don’t have to log in, otherwise the search results are biased by their algorithm).

Importantly, it only retrieves job offers that are less than 24 hours old. When the script is running on my computer, I retrieve around 300 job offers every day, all professions included.

As part of the project, I’m running this script on a virtual machine in a cloud provider. Interestingly, the number of job offers retrieved varies between 75 and 110, but never more than that.

I’ve tried running the script on docker (on my computer) to test: there are never more than 30 job offers scraped.

Do you know what might be influencing this? Absolutely no configuration elements are changed from one platform to another, it’s the same script with the same options, yet the ability to get relevant search results seems to be impacted by something I don’t know. Any ideas?

For information, I use the following Selenium options:

options.add_argument('--ignore-certificate-errors') 
options.add_argument('--incognito') 
options.add_argument('--headless') 
options.add_argument('--no-sandbox') 
options.add_argument('--disable-dev-shm-usage') 
options.add_argument('--lang=fr-FR') 
options.add_argument('--disable-features=MediaSessionService') 
options.add_argument('--disable-features=VizDisplayCompositor') 

And the url for displaying search results is as follows: {job_search.replace(‘ ‘, ‘%20’)} displays the job and launches a search:

url = f "https://www.linkedin.com/jobs/search?keywords={job_search.replace(' ', '%20')}&location=France&locationId=&geoId=105015875&sortBy=R&f_TPR=r86400&position=1&pageNum=0" 

I’m not giving the full script because it runs normally whatever the environment, I have the impression that the problem may be linked to the network or something else.

2

Answers


  1. Chosen as BEST ANSWER

    I may have found the solution of my issue.

    I was using an AWS EC2 instance t2.micro which have "Low to Moderate" network performance.

    I upgraded the instance to a t3.micro which have "Up to 5 Gigabit" network performance.

    I don't know why, but now it's working correctly, as if I start the script on my own computer.


  2. I am a beginner in crawling. My supervisor requires me to analyze the flow information of a certain position in LinkedIn, but I cannot crawl their information. Can you share some of your current ideas?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search