skip to Main Content

I am trying to web scrape online job offering web sites for my coursera project.
I keep getting a 403 error which, after I searched for its meaning online, I found out that it means that the web site has anti-web scraping protection.
Does anyone any countermeasure for this ?

PS: I have tried web scraping on indeed and weworkremotely web sites with the same error after executing my code.
Here’s my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://weworkremotely.com/remote-jobs'

# We Work Remotely website blocks traffic from non-browsers, so we add extra parameters
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

# Send a request to the website and get the HTML content
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    # Creating empty lists to store the data
    job_titles = []
    companies = []
    locations = []
    job_links = []

    job_sections = soup.find_all('section', class_='jobs')

    for section in job_sections:
        jobs = section.find_all('li', class_='feature')  # Ensure this class matches the site's HTML

        for job in jobs:
            # Job title
            title_tag = job.find('span', class_='title')
            title = title_tag.text.strip() if title_tag else 'N/A'
            job_titles.append(title)

            # Company name
            company_tag = job.find('span', class_='company')
            company = company_tag.text.strip() if company_tag else 'N/A'
            companies.append(company)

            # Location
            location_tag = job.find('span', class_='region company')
            location = location_tag.text.strip() if location_tag else 'Remote'
            locations.append(location)

            # Job link
            job_link_tag = job.find('a', href=True)
            job_link = 'https://weworkremotely.com' + job_link_tag['href'] if job_link_tag else 'N/A'
            job_links.append(job_link)

    # Create a DataFrame using the extracted data
    job_data = pd.DataFrame({
        'Job Title': job_titles,
        'Company': companies,
        'Location': locations,
        'Job Link': job_links
    })

    # Save the data to a CSV file
    job_data.to_csv('we_work_remotely_jobs.csv', index=False)
    print("Job listings have been successfully saved to we_work_remotely_jobs.csv")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

2

Answers


  1. They give you the data in RSS/XML – https://weworkremotely.com/remote-job-rss-feed

    You don’t need to scrape the HTML, just hit the RSS feeds and digest the data from there. Make sure you control how often you hit this server, seems rapid succession will yield to you getting blocked; so tap these feeds 3-5 times daily and you should be good.

    Login or Signup to reply.
  2. To address a 403 Forbidden error when web scraping:

    1. Use a Valid User-Agent: Set a common User-Agent header to mimic a browser.

    2. Use Proxies: Rotate IP addresses using proxies to avoid IP blocking.

    3. Respect robots.txt: Check and follow the website’s scraping rules.

    4. Add Delays: Introduce delays between requests to mimic human behavior.

    5. Handle JavaScript: Use tools like Selenium for websites with JavaScript-rendered content.

    Source: ScrapingBee – How to Handle a 403 Forbidden Error in Web Scraping

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search