I am trying to web scrape online job offering web sites for my coursera project.
I keep getting a 403 error which, after I searched for its meaning online, I found out that it means that the web site has anti-web scraping protection.
Does anyone any countermeasure for this ?
PS: I have tried web scraping on indeed and weworkremotely web sites with the same error after executing my code.
Here’s my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://weworkremotely.com/remote-jobs'
# We Work Remotely website blocks traffic from non-browsers, so we add extra parameters
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}
# Send a request to the website and get the HTML content
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Creating empty lists to store the data
job_titles = []
companies = []
locations = []
job_links = []
job_sections = soup.find_all('section', class_='jobs')
for section in job_sections:
jobs = section.find_all('li', class_='feature') # Ensure this class matches the site's HTML
for job in jobs:
# Job title
title_tag = job.find('span', class_='title')
title = title_tag.text.strip() if title_tag else 'N/A'
job_titles.append(title)
# Company name
company_tag = job.find('span', class_='company')
company = company_tag.text.strip() if company_tag else 'N/A'
companies.append(company)
# Location
location_tag = job.find('span', class_='region company')
location = location_tag.text.strip() if location_tag else 'Remote'
locations.append(location)
# Job link
job_link_tag = job.find('a', href=True)
job_link = 'https://weworkremotely.com' + job_link_tag['href'] if job_link_tag else 'N/A'
job_links.append(job_link)
# Create a DataFrame using the extracted data
job_data = pd.DataFrame({
'Job Title': job_titles,
'Company': companies,
'Location': locations,
'Job Link': job_links
})
# Save the data to a CSV file
job_data.to_csv('we_work_remotely_jobs.csv', index=False)
print("Job listings have been successfully saved to we_work_remotely_jobs.csv")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
2
Answers
They give you the data in RSS/XML – https://weworkremotely.com/remote-job-rss-feed
You don’t need to scrape the HTML, just hit the RSS feeds and digest the data from there. Make sure you control how often you hit this server, seems rapid succession will yield to you getting blocked; so tap these feeds 3-5 times daily and you should be good.
To address a 403 Forbidden error when web scraping:
Use a Valid User-Agent: Set a common User-Agent header to mimic a browser.
Use Proxies: Rotate IP addresses using proxies to avoid IP blocking.
Respect robots.txt: Check and follow the website’s scraping rules.
Add Delays: Introduce delays between requests to mimic human behavior.
Handle JavaScript: Use tools like Selenium for websites with JavaScript-rendered content.
Source: ScrapingBee – How to Handle a 403 Forbidden Error in Web Scraping