skip to Main Content

I’m building a web scraper for a school web page. I imported Puppeteer to scrape a client side rendered HTML file. However, I met some problems while developing.

My code is:

const puppeteer = require("puppeteer");

async function scrapeData(url) {
  console.log("Target URL: ", url);

  const browser = await puppeteer.launch({ headless: "new" });

  try {
    const page = await browser.newPage();

    await page.goto(url);

    // wait for client-side loading
    await page.waitForSelector(".tit");

    // get texts from html. ignore this code.
    const titles = await page.$$eval(".tit a", (elements) => {
      return elements.map((element) => element.textContent);
    });

    console.log("before click");

    // click element which has ".tit" class.
    // that element have onclick event-listener (checked with chrome manually)
    // however, this code throws timeout exception from `page.waitForNavigation()`
    await Promise.all([page.waitForNavigation(), page.click(".tit")]);

    console.log("navigation success.");

    const newUrl = page.url();

    const result = {
      titles,
      newUrl,
    };

    return result;
  } finally {
    await browser.close();
  }
}

const targetUrl = "https://kau.ac.kr/web/pages/gc32172b.do";
scrapeData(targetUrl)
  .then((result) => {
    console.log("Scraped Titles:", result.titles);
    console.log("New URL after click:", result.newUrl);
  })
  .catch((error) => console.error("Error during scraping:", error));

Summary for my code:

  1. Puppeteer opens browser and move to "https://kau.ac.kr/web/pages/gc32172b.do".
  2. Wait for rendering and then click the element (which has the '.tit' class).
  3. When client clicks the '.tit' class element, the browser navigates to a new url. (There is no other option, because it navigates to the new URL dynamically)
  4. After navigation, get the navigated URL and return the URL value.

By the way, the code await Promise.all([page.waitForNavigation(), page.click(".tit")]); throws timeout exception.

What I tried:

  1. With chrome, I tried this code in the console.
const title = document.querySelector(".tit");
title.click();
// I checked this codes navigate browser
  1. I set timeout manually via Puppeteer’s API instead of waitForNavigation. However, I could not get the new URL.

Does it mean page.click() occurs creation of new page and navigation to new URL?

2

Answers


  1. <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Home</title>
    </head>
    <body>
        <div>
            <p>Home</p>
            <a href="https://www.example.com">Ссылка</a>
            <a href="https://www.exampleqwewrtqry.com">Link</a>
        </div>
    </body>
    
    </html>
    <script>
        var link = document.querySelector('a');
        link.addEventListener('mouseover', function() {
            console.log(this.href);
        })
    </script>
    

    this is how it can be implemented in pure jsenter image description here

    Login or Signup to reply.
  2. Try using an untrusted click, as described in my blog post:

    await Promise.all([
      page.waitForNavigation(),
      page.$eval(".tit a", el => el.click()),
    ]);
    

    Trusted clicks with page.click() are complex and require visibility, and Puppeteer is often unable to execute the click properly due to behavior of certain pages.

    Another trick, if this didn’t work, would be to grab the href from the link and use page.goto() to navigate directly to it. In most cases, when automating in a scraping context, there’s no obligation to mimic user interaction as there is in tests. There’s usually a way to bypass fussy clicking. In this case, though, there doesn’t seem to be an href present on the link, but the strategy might come in handy elsewhere.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search