I am new to Python that I am learning for scraping purposes. I am using BeautifulSoup to collect descriptions from job offers at: https://justjoin.it/offers/itds-net-fullstack-developer-angular
On another site with job offers, using the same code with different div classes I can find what I need. I wrote this piece of code for justjoin.it
import requests
from bs4 import BeautifulSoup
link="https://justjoin.it/offers/jungle-devops-engineer"
response_IDs=requests.get(link)
soup=BeautifulSoup(response_IDs.text, 'html.parser')
Search_part = soup.find(id='root')
description= Search_part.find_all('div', class_='css-gz8dae')
for i in description:
print(i)
Please, help me to write a functional code.
2
Answers
As mentioned in the comments, the issue is that the content on this site is rendered using JavaScript, so requests will not be able to scrape the dynamic content. Selenium can fix this issue as it uses a web driver to render/execute the JavaScript.
First, make sure you have installed Selenium:
For google colab please add a
!
in frond ofpip install
(see below).As I mentioned I run all my python on google colab, which uses FireFox. This works for me:
This is the output:
In case you use chrome change this line
with this:
To run the whole thing on google colab you need to install selenium and firefox like this first:
Then, you will also need the GeckoDriver which should be set in the system’s PATH:
And after these installations run the code above.
As Pawel Kam and cconsta1 have explained, in order for the website to fully render, a bunch of JS needs to be executed. If you want the entirety of the website’s HTML then just use selenium (as cconsta1 has detailed in their answer). But if you only want the info in Description section of the job posting, then the following solution is arguably more appropriate.
Getting the JSON file that contains the job Description info.
Using my browser’s Dev Tools I found that the website makes a GET request to this API to get all of the information you see on the job posting. Specifically, the response to the request is a JSON.
Thus, if you only want data shown in the job posting, all you have to do is request the JSON file and then use BeautifulSoup to parse it for the specific data you want.
I found this article helpful when I was first learning about web scraping by "reverse engineering" a website’s requests.
The following script can be used to get the JSON file and parse the HTML of the Description section:
The other print outputs are kind of large, so I’ll only show the output of
print(job_summary)
:You’ll have to play around with it to get the exact info you want. Let me know if you need me to clarify anything.