skip to Main Content

I have build a scraper using Puppeteer and Node.js and now i want to dockerize it. I’ve tried multiple ways to tackle this, but encountering issue when puppeteer tries to start the browser for scraping.

My current basic Dockerfile without Puppeteer or any other dependencies:
I’ve tried multiple ways to update this Dockerfile in every sense (adding chrome, puppeteer) but doesn’t work

# Use Node.js runtime as the base image
FROM node:18

# Set the working directory in the container
WORKDIR /usr/src/app

# Copy package.json and package-lock.json to the working directory
COPY package*.json ./

# Install dependencies
RUN npm install

# Copy the rest of the application code
COPY . .

# Expose the port the app runs on
EXPOSE 8080

# Command to run the application
CMD ["node", "scraper.js"]

Code :
Snippet which triggers/launches the browser

// Launch browser
const browser = await launch({ headless: true, defaultViewport: null });

Can someone help me here how can i tackle this to work ideally ?

Tried every possible way from here, here and here

Encountered Error :

An error occurred during scraping:

Error: Failed to launch the browser process!
web-crawler-1  | rosetta error: failed to open elf at /lib64/ld-linux-x86-64.so.2
web-crawler-1  |  
web-crawler-1  | 
web-crawler-1  | 
web-crawler-1  | TROUBLESHOOTING: https://pptr.dev/troubleshooting
web-crawler-1  | 
web-crawler-1  |     at Interface.onClose (file:///usr/src/app/node_modules/@puppeteer/browsers/lib/esm/launch.js:301:24)
web-crawler-1  |     at Interface.emit (node:events:529:35)
web-crawler-1  |     at Interface.close (node:internal/readline/interface:534:10)
web-crawler-1  |     at Socket.onend (node:internal/readline/interface:260:10)
web-crawler-1  |     at Socket.emit (node:events:529:35)
web-crawler-1  |     at endReadableNT (node:internal/streams/readable:1400:12)
web-crawler-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

2

Answers


  1. Chosen as BEST ANSWER

    This solution worked for me.

    To run Puppeteer inside a Docker container you should install Google Chrome manually because, in contrast to the Chromium package offered by Debian, Chrome only offers the latest stable version.

    Install browser on Dockerfile :

    FROM node:18
    
    # We don't need the standalone Chromium
    ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
    
    # Install Google Chrome Stable and fonts
    # Note: this installs the necessary libs to make the browser work with Puppeteer.
    RUN apt-get update && apt-get install curl gnupg -y 
      && curl --location --silent https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - 
      && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' 
      && apt-get update 
      && apt-get install google-chrome-stable -y --no-install-recommends 
      && rm -rf /var/lib/apt/lists/*
    
    # Install your app here...
    

    Additionally, If you are in an ARM-based CPU (Apple M1) like me, you should use the --platform linux/amd64 argument when you build the Docker image.

    Build Command : docker build --platform linux/amd64 -t <image-name> .

    Note : After updating your Dockerfile, make sure to update the puppeteer script, while launching the puppeteer browser add executable path with the path to chrome we recently installed on the machine.

    const browser = await launch({
       headless: true,
       defaultViewport: null,
       executablePath: '/usr/bin/google-chrome',
       args: ['--no-sandbox'],
    });
    

  2. Parv’s solution worked for me in my local docker but not in an azure kubernetes cluster (aks).

    That’s my final solution:

    1. use official puppeteer docker image: ghcr.io/puppeteer/puppeteer
    2. Set environment variables XDG_CONFIG_HOME=/tmp/.chromium and XDG_CACHE_HOME=/tmp/.chromium
      Thanks to https://github.com/puppeteer/puppeteer/issues/11023#issuecomment-1776247197

    My Dockerfile

    FROM ghcr.io/puppeteer/puppeteer:22
    
    USER root
    
    # Add user so we don't need --no-sandbox.
    RUN mkdir -p /home/pptruser/Downloads /app 
        && chown -R pptruser:pptruser /home/pptruser 
        && chown -R pptruser:pptruser /app
    
    # Run everything after as non-privileged user.
    USER pptruser
    
    # Install Puppeteer under /node_modules so it's available system-wide
    COPY package.json /app/
    RUN cd /app/ && npm install
    COPY myscript.js /app/
    
    ENTRYPOINT ["/usr/local/bin/node", "/app/myscript.js"]
    

    Snippet of kubernetes deployment yaml

          env:
            - name: XDG_CONFIG_HOME
              value: /tmp/.chromium
            - name: XDG_CACHE_HOME
              value: /tmp/.chromium
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search