skip to Main Content

I am having an issue that I can simply not explain. I am scraping kleinanzeigen.de using proxies, which seems to work perfectly on my machine, but if I dockerize the application or have anyone else execute the code using the exact version and libraries it will return a 403 error. I know for a fact that the proxy is being used on every machine, since I can see the requests going out on the proxy dashboard. I have also tried adding several request headers with no success.

Dockerfile:

FROM python:3.10.12-slim

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN apt-get update -y && 
    apt-get install -y postgresql postgresql-contrib && 
    rm -rf /var/lib/apt/lists/* && 
    pip install --no-cache-dir -r requirements.txt && 
    rm -rf /root/.cache && 
    apt-get autoremove -y

# STACKOVERFLOW
ENV PYTHONUNBUFFERED=1

CMD ["python", "main.py"]

Python code fragment:

def request_with_proxy(url, headers={}):
    # Add random user agent to headers
    headers["User-Agent"] = user_agent_rotator.get_random_user_agent()
    # Configure proxy
    try:
        proxy_url = f'http://{os.environ["PROXY_USER"]}:{os.environ["PROXY_PASSWORD"]}@p.webshare.io:80'
        proxies = {
            'http': proxy_url,
            'https': proxy_url
        }
    except:
        raise TypeError("MISSING PROXY ENVIRONMENT VARIABLES PROXY_USER AND PROXY_PASSWORD")

    # Retry 3 times before crashing
    for _ in range(ATTEMPTS):
        try:
            response = requests.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
            print(response)
            print(response.status_code)
            return response
        except Exception as E: print(E)

Note that I have removed the environment variables and I’m only showing a fragment from the code. Thank you so much!

2

Answers


  1. This is the setup I needed to reproduce your issue:

    🗎 Dockerfile

    FROM python:3.10.12-slim
    
    WORKDIR /app
    
    COPY requirements.txt /app
    
    RUN apt-get update -y && 
        apt-get install -y postgresql postgresql-contrib && 
        rm -rf /var/lib/apt/lists/* && 
        pip install --no-cache-dir -r requirements.txt && 
        rm -rf /root/.cache && 
        apt-get autoremove -y
    
    ENV PYTHONUNBUFFERED=1
    
    COPY main.py /app
    # NOTE: You should not copy the .env file into the image. This is for illustration only!
    COPY .env /app
    
    CMD ["python", "main.py"]
    

    🗎 requirements.txt

    python-dotenv==1.0.1
    random-user-agent==1.0.1
    requests==2.31.0
    

    🗎 main.py

    import os
    import base64
    import requests
    from random_user_agent.user_agent import UserAgent
    from random_user_agent.params import SoftwareName, OperatingSystem
    from dotenv import load_dotenv
    
    load_dotenv()
    #
    # The .env file contains definitions for the following environment variables:
    #
    PROXY_USER = os.environ["PROXY_USER"]
    PROXY_PWRD = os.environ["PROXY_PWRD"]
    PROXY_HOST = os.environ["PROXY_HOST"]
    PROXY_PORT = os.environ["PROXY_PORT"]
    #
    # The host and port are from the proxy list on webshare.io.
    
    software_names = [SoftwareName.CHROME.value]
    operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value] 
    
    user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
    
    ATTEMPTS = 3
    TIMEOUT = 30
    
    def request_with_proxy(url, headers={}):
        headers["User-Agent"] = user_agent_rotator.get_random_user_agent()
        try:
            proxy_url = f'http://{PROXY_USER}:{PROXY_PWRD}@{PROXY_HOST}:{PROXY_PORT}'
            proxies = {
                'http': proxy_url,
                'https': proxy_url
            }
        except:
            raise TypeError("Missing environment variables!")
    
        print(url)
        print(headers)
    
        for _ in range(ATTEMPTS):
            try:
                response = requests.get(url, headers=headers, proxies=proxies, timeout=TIMEOUT)
                print(response.status_code)
                return response
            except Exception as E: print(E)
    
    if __name__ == "__main__":
        request_with_proxy("https://github.com/davidteather/everything-web-scraping/stargazers")
        request_with_proxy("https://kleinanzeigen.de/")
    

    However with this I got a 403 error frmo https://kleinanzeigen.de/ regardless of whether I was running from a Docker container or directly from the host. Are you sure that it runs directly from your host machine?

    As a control I retrieved another URL (https://github.com/davidteather/everything-web-scraping/stargazers) via the proxy and it worked fine both directly from the host and using Docker.

    Screenshot below shows running via Docker and from host.

    enter image description here

    Login or Signup to reply.
  2. 403 error appears when the website blocking your request. There are 2 reasons.

    1. Proxy’s ip is simply blocked by website.
    2. Antibot system detects that you are not a real user and blocks your request.

    If your proxy is not banned on this website, It means that you need to try use another tools for rendering and executing javascript, try selenium or similar tools. This is a common problem and in majority rendering source websites helps to avoid anti-bot systems.
    https://pypi.org/project/selenium/

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search