Docker - Webscraping with python and getting 403 http error on every device except my machine

HCook886
March 13, 2024
130 views
0 votes
2 Answers

I am having an issue that I can simply not explain. I am scraping kleinanzeigen.de using proxies, which seems to work perfectly on my machine, but if I dockerize the application or have anyone else execute the code using the exact version and libraries it will return a 403 error. I know for a fact that the proxy is being used on every machine, since I can see the requests going out on the proxy dashboard. I have also tried adding several request headers with no success.

Dockerfile:

FROM python:3.10.12-slim

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN apt-get update -y && 
    apt-get install -y postgresql postgresql-contrib && 
    rm -rf /var/lib/apt/lists/* && 
    pip install --no-cache-dir -r requirements.txt && 
    rm -rf /root/.cache && 
    apt-get autoremove -y

# STACKOVERFLOW
ENV PYTHONUNBUFFERED=1

CMD ["python", "main.py"]

Python code fragment:

def request_with_proxy(url, headers={}):
    # Add random user agent to headers
    headers["User-Agent"] = user_agent_rotator.get_random_user_agent()
    # Configure proxy
    try:
        proxy_url = f'http://{os.environ["PROXY_USER"]}:{os.environ["PROXY_PASSWORD"]}@p.webshare.io:80'
        proxies = {
            'http': proxy_url,
            'https': proxy_url
        }
    except:
        raise TypeError("MISSING PROXY ENVIRONMENT VARIABLES PROXY_USER AND PROXY_PASSWORD")

    # Retry 3 times before crashing
    for _ in range(ATTEMPTS):
        try:
            response = requests.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
            print(response)
            print(response.status_code)
            return response
        except Exception as E: print(E)

Note that I have removed the environment variables and I’m only showing a fragment from the code. Thank you so much!

Answers

This is the setup I needed to reproduce your issue:

🗎 Dockerfile

FROM python:3.10.12-slim

WORKDIR /app

COPY requirements.txt /app

RUN apt-get update -y && 
    apt-get install -y postgresql postgresql-contrib && 
    rm -rf /var/lib/apt/lists/* && 
    pip install --no-cache-dir -r requirements.txt && 
    rm -rf /root/.cache && 
    apt-get autoremove -y

ENV PYTHONUNBUFFERED=1

COPY main.py /app
# NOTE: You should not copy the .env file into the image. This is for illustration only!
COPY .env /app

CMD ["python", "main.py"]

🗎 requirements.txt

python-dotenv==1.0.1
random-user-agent==1.0.1
requests==2.31.0

🗎 main.py

import os
import base64
import requests
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
from dotenv import load_dotenv

load_dotenv()
#
# The .env file contains definitions for the following environment variables:
#
PROXY_USER = os.environ["PROXY_USER"]
PROXY_PWRD = os.environ["PROXY_PWRD"]
PROXY_HOST = os.environ["PROXY_HOST"]
PROXY_PORT = os.environ["PROXY_PORT"]
#
# The host and port are from the proxy list on webshare.io.

software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value] 

user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)

ATTEMPTS = 3
TIMEOUT = 30

def request_with_proxy(url, headers={}):
    headers["User-Agent"] = user_agent_rotator.get_random_user_agent()
    try:
        proxy_url = f'http://{PROXY_USER}:{PROXY_PWRD}@{PROXY_HOST}:{PROXY_PORT}'
        proxies = {
            'http': proxy_url,
            'https': proxy_url
        }
    except:
        raise TypeError("Missing environment variables!")

    print(url)
    print(headers)

    for _ in range(ATTEMPTS):
        try:
            response = requests.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
            print(response.status_code)
            return response
        except Exception as E: print(E)

if __name__ == "__main__":
    request_with_proxy("https://github.com/davidteather/everything-web-scraping/stargazers")
    request_with_proxy("https://kleinanzeigen.de/")

However with this I got a 403 error frmo https://kleinanzeigen.de/ regardless of whether I was running from a Docker container or directly from the host. Are you sure that it runs directly from your host machine?

As a control I retrieved another URL (https://github.com/davidteather/everything-web-scraping/stargazers) via the proxy and it worked fine both directly from the host and using Docker.

Screenshot below shows running via Docker and from host.

- EPguitars
- March 13, 2024 at 11:38 am
- 0 votes
0
403 error appears when the website blocking your request. There are 2 reasons.
1. Proxy’s ip is simply blocked by website.
2. Antibot system detects that you are not a real user and blocks your request.
If your proxy is not banned on this website, It means that you need to try use another tools for rendering and executing javascript, try selenium or similar tools. This is a common problem and in majority rendering source websites helps to avoid anti-bot systems.
https://pypi.org/project/selenium/
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Docker – Webscraping with python and getting 403 http error on every device except my machine

Answers