skip to Main Content

I am scraping dynamic website with puppeteer. My goal is to be able to create as much generic scraping logic as possible, which will also remove a lot of boilerplate code. So for that reason, I created external function that scrapes the data, given certain parameters. The problem was that when I tried to use that function inside page.evaluate() puppeteer method, I ran into a ReferenceError that this function was not defined.

Did some research and the page.exposeFunction() & page.addScriptTag() came out as a possible solutions. However when I tried to use them inside my scraper, addScriptTag() wasn’t working and exposeFunction() didn’t give me the ability to access DOM elements inside the exposed function. I understood that exposeFunction() is being executed inside Node.js, while addScriptTag() – in the browser, but I don’t know how to proceed further with that information and if it is even valuable for my case.

Here is my scraper:

import { Browser } from "puppeteer";

import { dataMapper } from "../../utils/api/functions/data-mapper.js";

export const mainCategoryScraper = async (browser: Browser) => {
  const [page] = await browser.pages();

  await page.setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
  );

  await page.setRequestInterception(true);

  page.on("request", (req) => {
    if (
      req.resourceType() === "stylesheet" ||
      req.resourceType() === "font" ||
      req.resourceType() === "image"
    ) {
      req.abort();
    } else {
      req.continue();
    }
  });

  await page.goto("https://www.ozone.bg/pazeli-2d-3d/nastolni-igri", {
    waitUntil: "domcontentloaded",
  });

  /**
   * Function will execute in Node.js
   */
  // await page.exposeFunction('dataMapper', dataMapper);

  /**
   * The way of passing DOM elements to the function, because like that the function executes in the browser
   */
  // await page.addScriptTag({ content: `${dataMapper}` });

  const data = await page.evaluate(async () => {
    const contentContainer = document.querySelector(".col-main") as HTMLDivElement;

    const carousels = Array.from(
      contentContainer.querySelectorAll(".owl-item") as NodeListOf<HTMLDivElement>
    );

    const carouselsData = await dataMapper<HTMLDivElement>(carousels, ".title", "img", "a");

    return {
      carouselsData,
    };
  });
  await browser.close();

  return data;
};

And here is the dataMapper function:

import { PossibleTags } from "../typescript/types.js";

export const dataMapper = function <T extends HTMLDivElement>(items: Array<T>, ...selectors: string[]) {
  let hasTitle = false;

  for (const selector of selectors) {
    if (selector === ".title" || selector === "h3") {
      hasTitle = true;
      break;
    }
  }
  
  return items.map((item) => {
    const data: PossibleTags = {};

    return selectors.map((selector) => {
        
      const dataProp = item.querySelector(selector);

      switch (selector) {
        case ".title": {
          data["title"] = (dataProp as HTMLSpanElement)?.innerText;
          break;
        }
        case "h3": {
          data["title"] = (dataProp as HTMLHeadingElement)?.innerText;
          break;
        }
        case "h6": {
          data["subTitle"] = (dataProp as HTMLHeadingElement)?.innerText;
          break;
        }
        case "img": {
          if (!hasTitle) {
            data["img"] = (dataProp as HTMLImageElement)?.getAttribute("src") ?? undefined;
            break;
          }

          data["title"] = (dataProp as HTMLImageElement)?.getAttribute("alt") ?? undefined;
          break;
        }
        case "a": {
          data["url"] = (dataProp as HTMLAnchorElement)?.getAttribute("href") ?? undefined;
        }
        default: {
          throw new Error("Such selector is not yet added to the possible selectors");
        }
      }
    });
  });
};

When I use the page.exposeFunction('dataMapper', dataMapper);, it tells me that item.querySelector is not a function (inside dataMapper). And with await page.addScriptTag({ content: `${dataMapper}` });, it just throws error later on inside the page.evaluate, that dataMapper is not a function.

Update: when specifying path inside the addScriptTag, it still gives me: Error [ReferenceError]: dataMapper is not defined
*
Just to mention that the mainCategoryScraper * is later on used in scrapersHandler function, which decides what scraper to be executed, based on URL endpoint.

2

Answers


  1. Chosen as BEST ANSWER

    Here are the scrapers as a separate functions and this is why I am aiming to create one general function

    export const carouselsMapper = (items: HTMLDivElement[]) => {
      const carouselsData = items.map((item) => {
        const title = (item.querySelector(".title") as HTMLSpanElement)?.innerText;
        const img = (item.querySelector("img") as HTMLImageElement)?.getAttribute("src");
        const url = (item.querySelector("a") as HTMLAnchorElement)?.getAttribute("href");
    
        return {
          title,
          img,
          url,
        };
      });
    
      return carouselsData;
    };
    
    export const sliderMapper = (items: HTMLDivElement[]) => {
      const sliderData = items.map((item) => {
        const title = (item.querySelector("h3") as HTMLHeadingElement)?.innerText;
        const subTitle = (item.querySelector("h6") as HTMLHeadingElement)?.innerText;
        const img = (item.querySelector("img") as HTMLImageElement)?.getAttribute("src");
        const url = (item.querySelector("a") as HTMLAnchorElement)?.getAttribute("href");
    
        return {
          title,
          subTitle,
          img,
          url,
        };
      });
    
      return sliderData;
    };
    
    export const widgetsMapper = (items: HTMLDivElement[]) => {
      const sliderData = items.map((item) => {
        const title = (item.querySelector("img") as HTMLImageElement)?.getAttribute("alt");
        const img = (item.querySelector("img") as HTMLImageElement)?.getAttribute("src");
        const url = item.querySelector("a")?.getAttribute("href") as string;
    
        return {
          title,
          img,
          url,
        };
      });
    
      return sliderData;
    };
    

  2. As discussed in my comment, the approach here seems rather convoluted. I’d caution against premature abstractions.

    In general, once you need to add multiple conditions (switch and if) where there weren’t any before, you may be headed down the wrong path. These increase the cognitive complexity of the code. Complexity in a function can be acceptable if it reduces complexity for the caller, but if the contract for the function isn’t clear, then the abstraction may leak problems back to the caller.

    Packing all of the logic into your dataMapper function breaches the single responsibility principle and makes it unmaintainable, because you’ll need to keep burdening it further with additional types of structures. The control flow within the function is already difficult to grasp and can’t be extended in any sensible way. The caller should be responsible for explicitly encoding the structure to be scraped, rather than trying to write an all-in-one function that can’t sensibly be written for these structures.

    Another rule of thumb: if the factoring is difficult, then just keep the repetition. Or take a step back and try to write a different abstraction, either at a higher or lower level than the first attempt.

    In this case, you might write a couple of higher-level abstractions $$evalMap and $text, which let you write your data mappers more cleanly. These abstractions just clear some of the syntax out of the way, but don’t attempt to generalize scraping different structures with conditions.

    const puppeteer = require("puppeteer"); // ^22.7.1
    
    const url = "https://www.ozone.bg/pazeli-2d-3d/nastolni-igri";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      const ua =
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
      await page.setUserAgent(ua);
      await page.setRequestInterception(true);
      const blockedResources = [
        "image",
        "fetch",
        "other",
        "ping",
        "stylesheet",
        "xhr",
      ];
      page.on("request", req => {
        if (
          !req.url().startsWith("https://www.ozone.bg") ||
          blockedResources.includes(req.resourceType())
        ) {
          req.abort();
        } else {
          req.continue();
        }
      });
      await page.goto(url, {waitUntil: "domcontentloaded"});
    
      await page.evaluate(
        "window.$text = (el, s) => el.querySelector(s)?.textContent.trim();"
      );
    
      const $$evalMap = async (sel, mapFn) => {
        await page.waitForSelector(sel);
        return page.$$eval(
          sel,
          (els, mapFn) => els.map(new Function(`return ${mapFn}`)()),
          mapFn.toString()
        );
      };
    
      const carouselData = await $$evalMap(".owl-item", el => ({
        title: $text(el, ".title"),
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
      const widgetData = await $$evalMap(".widget-box", el => ({
        title: el.querySelector("img").alt,
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
      const sliderData = await $$evalMap(
        ".item.slick-slide",
        el => ({
          title: $text(el, "h3"),
          subTitle: $text(el, "h6"),
          img: el.querySelector("img").src,
          url: el.querySelector("a").href,
        })
      );
    
      console.log(carouselData);
      console.log(widgetData);
      console.log(sliderData);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    That said, there’s not that much harm in doing the work directly in $$evals:

    // ... same as above ...
    await page.goto(url, {waitUntil: "domcontentloaded"});
    
    await page.evaluate(
      "window.$text = (el, s) => el.querySelector(s)?.textContent.trim();"
    );
    
    await page.waitForSelector(".owl-item");
    const carouselData = await page.$$eval(".owl-item", els =>
      els.map(el => ({
        title: $text(el, ".title"),
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }))
    );
    
    const widgetData = await page.$$eval(".widget-box", els =>
      els.map(el => ({
        title: el.querySelector("img").alt,
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }))
    );
    
    const sliderData = await page.$$eval(
      ".item.slick-slide",
      els =>
        els.map(el => ({
          title: $text(el, "h3"),
          subTitle: $text(el, "h6"),
          img: el.querySelector("img").src,
          url: el.querySelector("a").href,
        }))
    );
    // ...
    

    If TypeScript types are getting in the way, consider moving the querySelectors out to similar helper functions as was done with $text. The $$eval calls can also be moved out to individual functions for each type.

    Summary and further remarks:

    • Avoid premature abstractions.
    • When factoring, stop if you’re introducing multiple conditions where there weren’t any before. switch/case/break is particularly nasty. If your abstractions are more verbose and hard to understand than the original repeated code, don’t do them, or try to find a different abstraction.
    • When factoring, write it the verbose way first, then try to abstract away the similarities (but a bit of repetition is acceptable–be honest about what’s easier to read and maintain).
    • as is discouraged in TS. Use it as little as possible in favor of variable types.
    • Use $$eval all the time. It’s the most generally useful scraping function in Puppeteer, avoiding an ugly Array.from(document.querySelectorAll) or element handles. If Puppeteer’s locators API matures in the future, $$eval may be supplanted, but for now it’s the way to go.
    • ?? undefined is unnecessary. If the left hand chained operator ?. fails, the expression evaluates to undefined anyway, so undefined ?? undefined is pointless.
    • Generally speaking, you don’t need addScriptTag or exposeFunction. If you’re writing $text-like abstractions often, you can jQuery or something like that to simplify querying.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search