skip to Main Content

I am working with NodeJS and the Puppeteer library to load a website and then check if a certain text is displayed on the page. I would like to count the number of occurrences of this specific text. Specifically, I would like this search to work exactly in the same manner as how the Ctrl+F function works in Chrome or Firefox.

Here’s the code I have so far:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // How do I count the occurrences of the specific text here?

  await browser.close();
})();

Can someone please help me with a solution on how to achieve this? Any help would be greatly appreciated. Thanks in advance!

3

Answers


  1. you can get all the text and then run regex or simple search.

    const extractedText = await page.$eval('*', (el) => el.innerText);
    console.log(extractedText);
    const regx = new Regex('--search word--', 'g')
    count = (extractedText.match(regx) || []).length;
    console.log(count);
    
    Login or Signup to reply.
  2. import puppeteer from 'puppeteer'
    
    (async () => {
      const textToFind = 'domain'
      const browser = await puppeteer.launch()
      const page = await browser.newPage()
      await page.goto('https://example.com')
    
      const text = await page.evaluate(() => document.documentElement.innerText)
    
      const n = [...text.matchAll(new RegExp(textToFind, 'gi'))].length
      console.log(`${textToFind} appears ${n} times`)
    
      await browser.close()
    })()
    
    Login or Signup to reply.
  3. As I mentioned in a comment, the Ctrl+f algorithm may not be as simple as you presume, but you may be able to approximate it by making a list of all visible, non-style/script/metadata values and text contents.

    Here’s a simple proof of concept:

    const puppeteer = require("puppeteer"); // ^19.7.2
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      const ua =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
      await page.setUserAgent(ua);
      const url = "https://www.google.com";
      await page.goto(url, {waitUntil: "domcontentloaded"});
      await page.evaluate(() =>
        window.isVisible = e =>
          // https://stackoverflow.com/a/21696585/6243352
          e.offsetParent !== null &&
          getComputedStyle(e).visibility !== "hidden" &&
          getComputedStyle(e).display !== "none"
      );
      const excludedTags = [
        "head",
        "link",
        "meta",
        "script",
        "style",
        "title",
      ];
      const text = await page.$$eval(
        "*",
        (els, excludedTags) =>
          els
            .filter(e =>
              !excludedTags.includes(e.tagName.toLowerCase()) &&
              isVisible(e)
            )
            .flatMap(e => [...e.childNodes])
            .filter(e => e.nodeType === Node.TEXT_NODE)
            .map(e => e.textContent.trim())
            .filter(Boolean),
        excludedTags
      );
      const values = await page.$$eval("[value]", els =>
        els
          .filter(isVisible)
          .map(e => e.value.trim())
          .filter(Boolean)
      );
      const visible = [
        ...new Set([...text, ...values].map(e => e.toLowerCase())),
      ];
      console.log(visible);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Output:

    [
      'about',
      'store',
      'gmail',
      'images',
      'sign in',
      'businesses and job seekers',
      'in your community',
      'are growing with help from google',
      'advertising',
      'business',
      'how search works',
      'carbon neutral since 2007',
      'privacy',
      'terms',
      'settings',
      'google search',
      "i'm feeling lucky"
    ]
    

    Undoubtedly, this has some false positives and negatives, and I’ve only tested it on google.com. Feel free to post a counterexample and I’ll see if I can toss it in.

    Also, since we run two separate queries, then combine the results and dedupe, ordering of the text isn’t the same as it appears on the page. You could query by *, [value] and use conditions to figure out which you’re working with if this matters. I’ve assumed your final goal is just a true/false "does some text exist?" semantic.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search