skip to Main Content

I have a classic "it works on my machine" problem, a web scraper I ran successfully on my laptop, but with a persistent error whenever I try and run it in a container.

My minimal reproducible dockerized example consists of the following files:

requirements.txt:

selenium==4.23.1  # 4.23.1
pandas==2.2.2
pandas-gbq==0.22.0
tqdm==4.66.2

Dockerfile:

FROM selenium/standalone-chrome:latest

# Set the working directory in the container
WORKDIR /usr/src/app

# Copy your application files
COPY . .

# Install Python and pip
USER root
RUN apt-get update && apt-get install -y python3 python3-pip python3-venv

# Create a virtual environment
RUN python3 -m venv /usr/src/app/venv

# Activate the virtual environment and install dependencies
RUN . /usr/src/app/venv/bin/activate && 
    pip install --no-cache-dir -r requirements.txt

# Switch back to the selenium user
USER seluser

# Set the entrypoint to activate the venv and run your script
CMD ["/bin/bash", "-c", "source /usr/src/app/venv/bin/activate && python -m scrape_ev_files"]

scrape_ev_files.py (slimmed down to just what’s needed to repro error):

import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service


def init_driver(local_download_path):
    os.makedirs(local_download_path, exist_ok=True)

    # Set Chrome Options    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--remote-debugging-port=9222")

    prefs = {
        "download.default_directory": local_download_path,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)

    # Set up the driver
    service = Service()

    chrome_options = Options()
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # Set download behavior
    driver.execute_cdp_cmd("Page.setDownloadBehavior", {
        "behavior": "allow",
        "downloadPath": local_download_path
    })

    return driver

if __name__ == "__main__":
    # PARAMS
    ELECTION = '2024 MARCH 5TH DEMOCRATIC PRIMARY'
    ORIGIN_URL = "https://earlyvoting.texas-election.com/Elections/getElectionDetails.do"
    CSV_DL_DIR = "downloaded_files"

    # initialize the driver
    driver = init_driver(local_download_path=CSV_DL_DIR)

shell command to reproduce the error:

docker build -t my_scraper .  # (no error)
docker run --rm -t my_scraper # (error)

stacktrace from error is below. Any help would be much appreciated! I’ve tried many iterations of my requirements.txt and Dockerfile attempting to fix this, but this error at this spot has been frustratingly persistent:

  File "/workspace/scrape_ev_files.py", line 110, in <module>
    driver = init_driver(local_download_path=CSV_DL_DIR)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/scrape_ev_files.py", line 47, in init_driver
    driver = webdriver.Chrome(service=service, options=chrome_options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
    super().__init__(
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chromium/webdriver.py", line 66, in __init__
    super().__init__(command_executor=executor, options=options)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 212, in __init__
    self.start_session(capabilities)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 299, in start_session
    response = self.execute(Command.NEW_SESSION, caps)["value"]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 354, in execute
    self.error_handler.check_response(response)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

2

Answers


  1. I’m not sure if this is the problem, but there’s certainly an issue with your python code.

    import os
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    
    
    def init_driver(local_download_path):
        os.makedirs(local_download_path, exist_ok=True)
    
        # Set Chrome Options    
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--remote-debugging-port=9222")
    
        prefs = {
            "download.default_directory": local_download_path,
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
        }
        chrome_options.add_experimental_option("prefs", prefs)
    
        # Set up the driver
        service = Service()
    
        chrome_options = Options()
        driver = webdriver.Chrome(service=service, options=chrome_options)
    
        # Set download behavior
        driver.execute_cdp_cmd("Page.setDownloadBehavior", {
            "behavior": "allow",
            "downloadPath": local_download_path
        })
    
        return driver
    
    if __name__ == "__main__":
        # PARAMS
        ELECTION = '2024 MARCH 5TH DEMOCRATIC PRIMARY'
        ORIGIN_URL = "https://earlyvoting.texas-election.com/Elections/getElectionDetails.do"
        CSV_DL_DIR = "downloaded_files"
    
        # initialize the driver
        driver = init_driver(local_download_path=CSV_DL_DIR)
    

    In this code, you repeated the chrome_options line:

    # Set Chrome Options    
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--remote-debugging-port=9222")
    
        prefs = {
            "download.default_directory": local_download_path,
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
        }
        chrome_options.add_experimental_option("prefs", prefs)
    
        # Set up the driver
        service = Service()
    
        chrome_options = Options() # REPEAT HERE
        driver = webdriver.Chrome(service=service, options=chrome_options)
    

    Again, I’m not sure if this is the problem, but removing it may clear you of future trouble.

    Login or Signup to reply.
  2. The error you’re encountering is commonly caused by issues when running Chrome in a Docker container. Below are short and long answers (long has more details).

    Short Answer

    To fix the SessionNotCreatedException error with Selenium Chrome in Docker:

    1. Use Correct Chrome Options:

      chrome_options.add_argument("--headless")
      chrome_options.add_argument("--no-sandbox")
      chrome_options.add_argument("--disable-dev-shm-usage")
      chrome_options.add_argument("--remote-debugging-port=9222")
      chrome_options.add_argument("--disable-gpu")
      
    2. Increase Shared Memory: Run the Docker container with increased shared memory.

      docker run --rm -t --shm-size=2g my_scraper
      
    3. Check Docker Resources: Ensure Docker has sufficient memory and CPU resources allocated, especially on Docker Desktop.

    4. Add Debugging Flags: Enable additional logging for more insights.

      chrome_options.add_argument("--enable-logging")
      chrome_options.add_argument("--v=1")
      

    These steps should help resolve issues with running Selenium Chrome in a Docker container.


    Long Answer

    You can resolve the issue by running Selenium Chrome in a Docker container, ensuring all dependencies are installed, configuring Chrome with the appropriate options, increasing shared memory, and adding more debugging information.

    Solution Steps

    1. Ensure All Chrome Dependencies Are Installed

      The Docker image you’re using (selenium/standalone-chrome) should already include the necessary dependencies, but sometimes you may need to install additional libraries.

      However, since you’re using the selenium/standalone-chrome image, it should already be configured correctly. Therefore, you shouldn’t need to install additional packages beyond what you’ve already included.

    2. Set Chrome Options Appropriately

      Ensure that your Chrome options are configured correctly for a Docker environment. You’re already using the following flags, which are good practices:

      • --headless: Run Chrome in headless mode (without a GUI).
      • --no-sandbox: Disable the sandbox for security reasons, which is often necessary in Docker.
      • --disable-dev-shm-usage: Avoid using /dev/shm, which may have limited space in Docker containers.
      • --remote-debugging-port=9222: Enables remote debugging, which is necessary for ChromeDriver to communicate with Chrome.

      Here’s a consolidated version of your init_driver function:

      from selenium.webdriver.chrome.options import Options
      from selenium import webdriver
      
      def init_driver(local_download_path):
          os.makedirs(local_download_path, exist_ok=True)
      
          chrome_options = Options()
          chrome_options.add_argument("--headless")
          chrome_options.add_argument("--no-sandbox")
          chrome_options.add_argument("--disable-dev-shm-usage")
          chrome_options.add_argument("--remote-debugging-port=9222")
          chrome_options.add_argument("--disable-gpu")  # Recommended for headless mode
      
          prefs = {
              "download.default_directory": local_download_path,
              "download.prompt_for_download": False,
              "download.directory_upgrade": True,
              "safebrowsing.enabled": True
          }
          chrome_options.add_experimental_option("prefs", prefs)
      
          driver = webdriver.Chrome(options=chrome_options)
          driver.execute_cdp_cmd("Page.setDownloadBehavior", {
              "behavior": "allow",
              "downloadPath": local_download_path
          })
      
          return driver
      
    3. Increase Shared Memory Allocation

      The --disable-dev-shm-usage flag reduces the likelihood of shared memory issues, but you can further mitigate this by increasing the shared memory size allocated to the Docker container.

      Run your Docker container with a larger shared memory allocation.

      docker run --rm -t --shm-size=2g my_scraper
      
    4. Check for Docker-Specific Issues

      Ensure that Docker has adequate permissions and resources on your host machine, especially if you’re running on macOS or Windows with Docker Desktop. Sometimes, insufficient memory or CPU allocations to Docker can cause Chrome to crash.

    5. Review the Docker Image

      Double-check that the selenium/standalone-chrome image best fits your use case. If the headless configuration fails, another image, such as selenium/standalone-chrome-debug, might provide more insights.

    6. Log Additional Debug Information

      You can increase verbosity in Chrome by adding more debugging arguments, such as --enable-logging and --v=1, which might help you diagnose the issue further.

      chrome_options.add_argument("--enable-logging")
      chrome_options.add_argument("--v=1")
      
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search