skip to Main Content

I am new to Python that I am learning for scraping purposes. I am using BeautifulSoup to collect descriptions from job offers at: https://justjoin.it/offers/itds-net-fullstack-developer-angular

On another site with job offers, using the same code with different div classes I can find what I need. I wrote this piece of code for justjoin.it

import requests
from bs4 import BeautifulSoup

link="https://justjoin.it/offers/jungle-devops-engineer"

response_IDs=requests.get(link)
soup=BeautifulSoup(response_IDs.text, 'html.parser')
Search_part = soup.find(id='root')
description= Search_part.find_all('div', class_='css-gz8dae')

for i in description:
    print(i)

Please, help me to write a functional code.

2

Answers


  1. As mentioned in the comments, the issue is that the content on this site is rendered using JavaScript, so requests will not be able to scrape the dynamic content. Selenium can fix this issue as it uses a web driver to render/execute the JavaScript.

    First, make sure you have installed Selenium:

    pip install selenium
    

    For google colab please add a ! in frond of pip install (see below).

    As I mentioned I run all my python on google colab, which uses FireFox. This works for me:

    import requests
    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    
    link = "https://justjoin.it/offers/jungle-devops-engineer"
    
    # Set up headless browser (no GUI)
    options = Options()
    options.headless = True
    browser = webdriver.Firefox(options=options)
    
    # Use Selenium to get the page source after JavaScript has executed
    browser.get(link)
    page_source = browser.page_source
    browser.quit()
    
    # Use BeautifulSoup to parse the HTML
    soup = BeautifulSoup(page_source, 'html.parser')
    description = soup.find_all('div', class_='css-gz8dae')
    
    for i in description:
        print(i.text)
    

    This is the output:

    Running a flexible Machine Learning engine at scale is hard. 
    We must ingest and process large volumes of data 
    uninterruptedly and store it in a scalable manner. 
    The data needs to be prepared and served to hundreds of 
    models constantly. All the predictions of the models, as well as other data pipelines, ...
    

    In case you use chrome change this line

    browser = webdriver.Firefox(options=options)
    

    with this:

    browser = webdriver.Chrome(options=options)
    

    To run the whole thing on google colab you need to install selenium and firefox like this first:

    !pip install selenium
    !apt-get update
    !apt install -y firefox
    !apt install -y wget
    !apt install -y unzip
    

    Then, you will also need the GeckoDriver which should be set in the system’s PATH:

    !wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz
    !tar -xvf geckodriver-v0.30.0-linux64.tar.gz
    !chmod +x geckodriver
    !mv geckodriver /usr/local/bin/
    

    And after these installations run the code above.

    Login or Signup to reply.
  2. As Pawel Kam and cconsta1 have explained, in order for the website to fully render, a bunch of JS needs to be executed. If you want the entirety of the website’s HTML then just use selenium (as cconsta1 has detailed in their answer). But if you only want the info in Description section of the job posting, then the following solution is arguably more appropriate.

    Getting the JSON file that contains the job Description info.

    Using my browser’s Dev Tools I found that the website makes a GET request to this API to get all of the information you see on the job posting. Specifically, the response to the request is a JSON.

    Thus, if you only want data shown in the job posting, all you have to do is request the JSON file and then use BeautifulSoup to parse it for the specific data you want.

    I found this article helpful when I was first learning about web scraping by "reverse engineering" a website’s requests.

    The following script can be used to get the JSON file and parse the HTML of the Description section:

    import requests
    import json
    from bs4 import BeautifulSoup
    
    def pretty_print_json(json_obj):
        json_string = json.dumps(json_obj, indent=4)
        print(json_string)
    
    def get_json(url, req_headers):
        response = requests.get(url, headers=req_headers)
    
        # makes JSON file into dict object
        return response.json()
        
    def find_first_element(html, tag):
            soup = BeautifulSoup(html, 'html.parser')
    
            # find first occurance of given element
            element = soup.find(tag)
            return element
    
    def pretty_print_html(html):
        soup = BeautifulSoup(html, 'html.parser')
        print(soup.prettify())
    
    if __name__ == "__main__":
    
        url = "https://justjoin.it/api/offers/itds-net-fullstack-developer-angular"
        api_headers = {
            "X-CSRF-Token": "/w2ocZnRs5LN43gzQsi8zWYcdAOVmhjBEpB/dduBn5rnhzjqOnvlo7SsrEdf5Rht3Aa2x/+/00OZJuh3tgmaDA=="
        }
        json_obj = get_json(url, api_headers)
    
        # view entire JSON file (in a readable format) 
        # to familiarize yourself where its structure
        pretty_print_json(json_obj)
    
        # access HTML that makes up Description section of job posting
        job_description_html = json_obj['body']
    
        # look at job description html
        pretty_print_html(job_description_html)
    
        # get the job summary (i.e. the opening paragraph of Description section) 
        job_summary = find_first_element(job_description_html, 'div').text
        print(job_summary)
    

    The other print outputs are kind of large, so I’ll only show the output of print(job_summary):

    As a .NET FullStack Developer (Angular) you will be working on implementing innovative 
    architectural solutions for our client in the banking sector. Our client is the first 
    fully online bank in Poland, setting directions for the development of mobile and online 
    banking. It is one of the strongest and fastest growing financial brands in Poland. Your 
    key responsibilities: 
    

    You’ll have to play around with it to get the exact info you want. Let me know if you need me to clarify anything.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search