skip to Main Content

I am trying to scrape the very first name on a table from a website that presents a basketball team and that team’s player’s names and statistics. When I do so the Navigation Timeout is exceeded meaning the value was not scraped in the given time, and on my client-side "Error Loading Data" appears. What am I doing wrong?

FYI – There are various debugging statements used that are not essential to the functioning of the code.

Here is my JavaScript code:

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static("public"));

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    console.log('Attempting to scrape data...');
    browser = await puppeteer.launch();
    const [page] = await browser.pages();

    // Increase the timeout to 60 seconds
    await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });

    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });

    const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());

    console.log('Scraping successful:', firstPlayerName);

    res.json({ firstPlayerName });
  } catch (err) {
    console.error('Error during scraping:', err);
    res.status(500).json({ error: 'Internal Server Error' });
  } finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

Here is my HTML code:

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <table>
    <p class="robo-header">Robo-Scout </p>
    <p class="robo-subheader"><br> Official Algorithmic Bakstball Scout</p>
    <tr>
      <td>
        <p id="myObjValue"> Loading... </p>
        <script>
          fetch('/scrape') // Send a GET request to the server
            .then(response => {
              if (!response.ok) {
                throw new Error('Network response was not ok');
              }
              return response.json();
            })
            .then(data => {
              console.log(data); // Check what data is received
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = data.firstPlayerName || 'Player name not found';
            })
            .catch(error => {
              console.error(error);
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = 'Error loading data'; // Display an error message
            });
        </script>
      </td>
    </tr>
  </table>
</body>
</html>

Here is the code from the cell of the table I’m trying to scrape:

                                    <td class="text-left">

    <a href="/player/maddie-bulbulia/girlsbasketball/season/2022-2023">Maddie Bulbulia</a> <small class="text-muted">Sophomore • G</small>
</td>

I have tried debugging the code, to trace why the value isn’t being pulled, by outputting when the value is not pulled, and the error. I have also tried increasing the Navigation Timeout to 60 seconds rather than 30 just in case my network was moving slowly, no changes.

2

Answers


  1. This code looks problematic:

    await page.goto(url, { timeout: 60000 });
    
    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });
    

    page.goto() already waits for navigation, so waiting for yet another navigation with page.waitForNavigation() causes a timeout.

    This is such a common mistake, I have a section about it in a blog post on typical Puppeteer mistakes. The solution is to remove the unnecessary page.waitForNavigation() line.

    Secondly, use page.goto(url, {waitUntil: "domcontentloaded"} rather than the default "load" event. Some anti-scraping approaches (or poorly-coded pages) seem to defer the load event, causing navigation timeouts. "domcontentloaded" is the fastest approach and almost always preferred.

    Going a step further, since the data is baked into the static HTML, you can block all resource requests and disable JS. Here’s an optimized script:

    const puppeteer = require("puppeteer"); // ^21.4.1
    
    const url = "<Your URL>";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch({headless: "new"});
      const [page] = await browser.pages();
      await page.setJavaScriptEnabled(false);
      await page.setRequestInterception(true);
      page.on("request", req =>
        req.url() === url ? req.continue() : req.abort()
      );
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const firstPlayerName = await page.$eval(
        "td.text-left a",
        player => player.textContent.trim()
      );
      console.log("Scraping successful:", firstPlayerName);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Going yet another step further, you may not even need Puppeteer. You can make a request with fetch, native in Node 18+, and parse the data you want from the response with a lightweight library like Cheerio.

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const url = "<Your URL>";
    
    fetch(url)
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
        const firstPlayerName = $("td.text-left a").first().text()
        console.log(firstPlayerName); // => Maddie Bulbulia
      })
      .catch(err => console.error(err));
    

    Here are some quick benchmarks.

    Unoptimized Puppeteer (only using "domcontentloaded"):

    real 0m2.974s
    user 0m1.004s
    sys  0m0.271s
    

    Optimized Puppeteer (using DCL, plus disabling JS and blocking resources):

    real 0m1.190s
    user 0m0.510s
    sys  0m0.114s
    

    Fetch/Cheerio:

    real 0m0.998s
    user 0m0.261s
    sys  0m0.049s
    

    If the scraped data doesn’t change often, you might consider caching the results of the scrape periodically so you can serve it up to your users instantly and more reliably.

    Login or Signup to reply.
  2. There are plenty of changes that you might want to check

    1. You’re using both await page.goto and await page.waitForNavigation, which may lead to a conflict. you might want to remove the await page.waitForNavigation line.

       await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
      
    2. If the page takes longer to load due to a slow network or other factors, you may want to increase the timeout for the $eval function.

       const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', { timeout: 60000 }, player => player.textContent.trim());
      
    3. Before using $eval, you may want to ensure that the element you are trying to select actually exists on the page. Use page.waitForSelector to wait for the element to appear.

       await page.waitForSelector('tbody tr:first-child .text-left a');
       const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());
      
    4. If the website is slow or experiencing network issues, increasing the timeout might not be enough. You can try adding a wait before making the request to ensure that the page is fully loaded.

      await page.waitForTimeout(5000); 
      await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
      
    5. Additionally, you might want to check the console output in the Puppeteer browser to see if there are any errors or messages that could provide more insight into the problem. You can enable the headless mode ({ headless: false }) when launching Puppeteer to visually inspect what is happening on the page during the scraping process.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search