I am having an issue that I can simply not explain. I am scraping kleinanzeigen.de using proxies, which seems to work perfectly on my machine, but if I dockerize the application or have anyone else execute the code using the exact version and libraries it will return a 403 error. I know for a fact that the proxy is being used on every machine, since I can see the requests going out on the proxy dashboard. I have also tried adding several request headers with no success.
Dockerfile:
FROM python:3.10.12-slim
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN apt-get update -y &&
apt-get install -y postgresql postgresql-contrib &&
rm -rf /var/lib/apt/lists/* &&
pip install --no-cache-dir -r requirements.txt &&
rm -rf /root/.cache &&
apt-get autoremove -y
# STACKOVERFLOW
ENV PYTHONUNBUFFERED=1
CMD ["python", "main.py"]
Python code fragment:
def request_with_proxy(url, headers={}):
# Add random user agent to headers
headers["User-Agent"] = user_agent_rotator.get_random_user_agent()
# Configure proxy
try:
proxy_url = f'http://{os.environ["PROXY_USER"]}:{os.environ["PROXY_PASSWORD"]}@p.webshare.io:80'
proxies = {
'http': proxy_url,
'https': proxy_url
}
except:
raise TypeError("MISSING PROXY ENVIRONMENT VARIABLES PROXY_USER AND PROXY_PASSWORD")
# Retry 3 times before crashing
for _ in range(ATTEMPTS):
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
print(response)
print(response.status_code)
return response
except Exception as E: print(E)
Note that I have removed the environment variables and I’m only showing a fragment from the code. Thank you so much!
2
Answers
This is the setup I needed to reproduce your issue:
🗎
Dockerfile
🗎
requirements.txt
🗎
main.py
However with this I got a 403 error frmo https://kleinanzeigen.de/ regardless of whether I was running from a Docker container or directly from the host. Are you sure that it runs directly from your host machine?
As a control I retrieved another URL (https://github.com/davidteather/everything-web-scraping/stargazers) via the proxy and it worked fine both directly from the host and using Docker.
Screenshot below shows running via Docker and from host.
403 error appears when the website blocking your request. There are 2 reasons.
If your proxy is not banned on this website, It means that you need to try use another tools for rendering and executing javascript, try selenium or similar tools. This is a common problem and in majority rendering source websites helps to avoid anti-bot systems.
https://pypi.org/project/selenium/