skip to Main Content

I am trying to create an Node.js API, that scrapes website, (I started only with Goodreads as a website to be scraped and will expand further when I first the optimize the approach) and provides the scraped data to the end user, which will be using my API.

My initial approach was planning the API structure, deciding to use puppeteer and then start creating. When creating the first endpoint successfully I noticed something – it tooks about 2-3 seconds in Postman, in order for the request to finish, which is really slow.

Here is my code:

scraper-handler.ts

import { NextFunction, Request, Response } from "express";
import { MOST_POPULAR_LISTS } from "../utils/api/urls-endpoints.js";
import { listScraper } from "./spec-scrapers/list-scraper.js";
import { lists } from "../utils/api/full-urls.js";
import puppeteer from "puppeteer";
import { GOODREADS_POPULAR_LISTS_URL } from "../utils/goodreads/urls.js";

export const scraperHandler = async (
  req: Request,
  res: Response,
  next: NextFunction
) => {
  const browser = await puppeteer.launch({
    // headless: false,
    // defaultViewport: null,
  });

  const pages = await browser.pages();

  await pages[0].setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
  );

  switch (req.url) {
    case `/${MOST_POPULAR_LISTS}`: {
      const result = await listScraper(
        browser,
        pages[0],
        GOODREADS_POPULAR_LISTS_URL,
        1,
        ".cell",
        ".listTitle",
        ".listTitle"
      );

      res.status(200).json({
        status: "success",
        data: result,
      });
      break;
    }
    default: {
      next();
      break;
    }
  }
};

And here is the case /${MOST_POPULAR_LISTS}:

list-scraper.ts

import puppeteer, { Page } from "puppeteer";
import { Browser } from "puppeteer";

export const listScraper = async (
  browser: Browser,
  page: Page,
  url: string,
  pageI = 1,
  main: string,
  title = "",
  ref = ""
)  => {
  // const page = await browser.newPage();

  await page.goto(url, {
    waitUntil: "domcontentloaded",
  });
  
  const books = await page.evaluate(
    (mainSelector, titleSelector, refSelector) => {
      // const nextLink = document.querySelector('a[rel="next"]');

      // console.log(nextLink);
      const elements = document.querySelectorAll(mainSelector);
      
      return Array.from(elements)
        .slice(0, 3)
        .map((element) => {
          const title =
            titleSelector.length > 0 &&
            (element.querySelector(titleSelector) as HTMLElement | null)
              ?.innerText;
          const ref =
            refSelector.length > 0 &&
            (element.querySelector(refSelector) as HTMLAnchorElement | null)
              ?.href;

          return { title, ref };
        });
    },
    main,
    title,
    ref
  );
  // await page.click(".pagination > a");

  await browser.close();

  return books;
};

Here is the Goodreads url, from where I started:

export const GOODREADS_POPULAR_LISTS_URL = "https://www.goodreads.com/list/popular_lists";

So my question is how I can optimize my approach and what techniques can I use, in order to make the scraping faster and thus drastically improve the performance of my API?

I searched various posts and many suggest some kind of manipulation of the CPU, but I didn’t understood how it could be used in my case. Also the Child process in Node.js was suggested quite few times.

Thank you in advance!

2

Answers


  1. There’s a few things that can be looked at to improve the speed of this.

    1. The first major pain point I see is creating a new puppeteer instance on each request. We can create a module following the singleton pattern.
    // browser_instance.ts
    import puppeteer from 'puppeteer';
    
    let browserInstance = null;
    
    export async function getBrowser() {
        if (!browserInstance) {
            browserInstance = await puppeteer.launch({ ... });
        }
        return browserInstance;
    }
    

    Then in your code, attempt to get the instance

    import { getBrowser } from './browser-instance.js';
    ....
    export const scraperHandler = async (req: Request, res: Response, next: NextFunction) => {
        const browser = await getBrowser();
        // ... use the browser instance for scraping
    };
    
    1. Consider using Promise.all to wait for all requests to finish in a concurrent manner.
    const results = await Promise.all(listUrls.map(async (url) => {
        const page = await browser.newPage();
        // ... scraping logic for each page ...
    }));
    
    1. A second approach would be to cache existing websites you’ve already hit if the data doesn’t change much.

    There are a lot more optimisations you can make. Consider researching how to find the slow points within your application and do some research on how to optimise the process.

    Login or Signup to reply.
  2. Your approach seems like overkill to me. The data you want is in the static HTML, so a single request to the page looks like enough to get the data, using Node’s built-in fetch API and a simple static HTML parser, Cheerio:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const url = "<Your URL>";
    
    fetch(url)
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
        const data = [...$(".listTitle")].map(e => ({
          title: $(e).text(),
          ref: $(e).attr("href"),
        }));
        console.log(data);
      })
      .catch(err => console.error(err));
    

    For sake of comparison, here’s an optimized Puppeteer version:

    const puppeteer = require("puppeteer"); // ^22.6.0
    
    const url = "<Your URL>";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      await page.setJavaScriptEnabled(false);
      const ua =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
      await page.setUserAgent(ua);
      await page.setRequestInterception(true);
      page.on("request", req => {
        if (req.url() === url) {
          req.continue();
        } else {
          req.abort();
        }
      });
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const data = await page.$$eval(".listTitle", els => els.map(el => ({
        title: el.textContent,
        ref: el.href,
      })));
      console.log(data);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Here’s (basically) your version–unoptimized other than "domcontentloaded", which is a good start:

    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      const ua =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
      await page.setUserAgent(ua);
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const data = await page.$$eval(".listTitle", els => els.map(el => ({
        title: el.textContent,
        ref: el.href,
      })));
      console.log(data);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Unoptimized Puppeteer:

    real 0m1.806s
    user 0m0.801s
    sys  0m0.204s
    

    Optimized Puppeteer:

    real 0m1.251s
    user 0m0.651s
    sys  0m0.107s
    

    Fetch + Cheerio:

    real 0m0.836s
    user 0m0.251s
    sys  0m0.035s
    

    Using fetch and Cheerio is over a 2x speedup, with simpler code and no dependency on a complex library like Puppeteer. The main reasons to use Puppeteer for scraping are when you need to perform complex interactions like clicks, intercept requests, work with cookies, avoid blocks, wait for SPAs to load, etc (some of those cases may apply here, but first add a user agent to the fetch request if you’re getting blocked).

    Regardless of the scraping approach you take, caching the response is a good idea. I doubt the data changes that much, so you can re-fetch only ever hour or so, possibly in a background job, and serve it instantly to everyone from the cache via your proxy.

    See this blog post of mine for details on the techniques I used to speed this script up.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search