I am trying to create an Node.js API, that scrapes website, (I started only with Goodreads as a website to be scraped and will expand further when I first the optimize the approach) and provides the scraped data to the end user, which will be using my API.
My initial approach was planning the API structure, deciding to use puppeteer and then start creating. When creating the first endpoint successfully I noticed something – it tooks about 2-3 seconds in Postman, in order for the request to finish, which is really slow.
Here is my code:
scraper-handler.ts
import { NextFunction, Request, Response } from "express";
import { MOST_POPULAR_LISTS } from "../utils/api/urls-endpoints.js";
import { listScraper } from "./spec-scrapers/list-scraper.js";
import { lists } from "../utils/api/full-urls.js";
import puppeteer from "puppeteer";
import { GOODREADS_POPULAR_LISTS_URL } from "../utils/goodreads/urls.js";
export const scraperHandler = async (
req: Request,
res: Response,
next: NextFunction
) => {
const browser = await puppeteer.launch({
// headless: false,
// defaultViewport: null,
});
const pages = await browser.pages();
await pages[0].setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);
switch (req.url) {
case `/${MOST_POPULAR_LISTS}`: {
const result = await listScraper(
browser,
pages[0],
GOODREADS_POPULAR_LISTS_URL,
1,
".cell",
".listTitle",
".listTitle"
);
res.status(200).json({
status: "success",
data: result,
});
break;
}
default: {
next();
break;
}
}
};
And here is the case /${MOST_POPULAR_LISTS}:
list-scraper.ts
import puppeteer, { Page } from "puppeteer";
import { Browser } from "puppeteer";
export const listScraper = async (
browser: Browser,
page: Page,
url: string,
pageI = 1,
main: string,
title = "",
ref = ""
) => {
// const page = await browser.newPage();
await page.goto(url, {
waitUntil: "domcontentloaded",
});
const books = await page.evaluate(
(mainSelector, titleSelector, refSelector) => {
// const nextLink = document.querySelector('a[rel="next"]');
// console.log(nextLink);
const elements = document.querySelectorAll(mainSelector);
return Array.from(elements)
.slice(0, 3)
.map((element) => {
const title =
titleSelector.length > 0 &&
(element.querySelector(titleSelector) as HTMLElement | null)
?.innerText;
const ref =
refSelector.length > 0 &&
(element.querySelector(refSelector) as HTMLAnchorElement | null)
?.href;
return { title, ref };
});
},
main,
title,
ref
);
// await page.click(".pagination > a");
await browser.close();
return books;
};
Here is the Goodreads url, from where I started:
export const GOODREADS_POPULAR_LISTS_URL = "https://www.goodreads.com/list/popular_lists";
So my question is how I can optimize my approach and what techniques can I use, in order to make the scraping faster and thus drastically improve the performance of my API?
I searched various posts and many suggest some kind of manipulation of the CPU, but I didn’t understood how it could be used in my case. Also the Child process in Node.js was suggested quite few times.
Thank you in advance!
2
Answers
There’s a few things that can be looked at to improve the speed of this.
puppeteer
instance on each request. We can create a module following the singleton pattern.Then in your code, attempt to get the instance
Promise.all
to wait for all requests to finish in a concurrent manner.There are a lot more optimisations you can make. Consider researching how to find the slow points within your application and do some research on how to optimise the process.
Your approach seems like overkill to me. The data you want is in the static HTML, so a single request to the page looks like enough to get the data, using Node’s built-in fetch API and a simple static HTML parser, Cheerio:
For sake of comparison, here’s an optimized Puppeteer version:
Here’s (basically) your version–unoptimized other than
"domcontentloaded"
, which is a good start:Unoptimized Puppeteer:
Optimized Puppeteer:
Fetch + Cheerio:
Using fetch and Cheerio is over a 2x speedup, with simpler code and no dependency on a complex library like Puppeteer. The main reasons to use Puppeteer for scraping are when you need to perform complex interactions like clicks, intercept requests, work with cookies, avoid blocks, wait for SPAs to load, etc (some of those cases may apply here, but first add a user agent to the fetch request if you’re getting blocked).
Regardless of the scraping approach you take, caching the response is a good idea. I doubt the data changes that much, so you can re-fetch only ever hour or so, possibly in a background job, and serve it instantly to everyone from the cache via your proxy.
See this blog post of mine for details on the techniques I used to speed this script up.