Javascript - Navigation Timeout Exceeded scraping table in Puppeteer

Evan
November 14, 2023
254 views
1 vote
2 Answers

I am trying to scrape the very first name on a table from a website that presents a basketball team and that team’s player’s names and statistics. When I do so the Navigation Timeout is exceeded meaning the value was not scraped in the given time, and on my client-side "Error Loading Data" appears. What am I doing wrong?

FYI – There are various debugging statements used that are not essential to the functioning of the code.

Here is my JavaScript code:

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static("public"));

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    console.log('Attempting to scrape data...');
    browser = await puppeteer.launch();
    const [page] = await browser.pages();

    // Increase the timeout to 60 seconds
    await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });

    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });

    const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());

    console.log('Scraping successful:', firstPlayerName);

    res.json({ firstPlayerName });
  } catch (err) {
    console.error('Error during scraping:', err);
    res.status(500).json({ error: 'Internal Server Error' });
  } finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

Here is my HTML code:

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <table>
    <p class="robo-header">Robo-Scout </p>
    <p class="robo-subheader"><br> Official Algorithmic Bakstball Scout</p>
    <tr>
      <td>
        <p id="myObjValue"> Loading... </p>
        <script>
          fetch('/scrape') // Send a GET request to the server
            .then(response => {
              if (!response.ok) {
                throw new Error('Network response was not ok');
              }
              return response.json();
            })
            .then(data => {
              console.log(data); // Check what data is received
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = data.firstPlayerName || 'Player name not found';
            })
            .catch(error => {
              console.error(error);
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = 'Error loading data'; // Display an error message
            });
        </script>
      </td>
    </tr>
  </table>
</body>
</html>

Here is the code from the cell of the table I’m trying to scrape:

                                    <td class="text-left">

    <a href="/player/maddie-bulbulia/girlsbasketball/season/2022-2023">Maddie Bulbulia</a> <small class="text-muted">Sophomore • G</small>
</td>

I have tried debugging the code, to trace why the value isn’t being pulled, by outputting when the value is not pulled, and the error. I have also tried increasing the Navigation Timeout to 60 seconds rather than 30 just in case my network was moving slowly, no changes.

Answers

- ggorlen
- November 13, 2023 at 10:34 pm
- 0 votes
0
This code looks problematic:
```
await page.goto(url, { timeout: 60000 });

// Wait for navigation to complete
await page.waitForNavigation({ timeout: 60000 });
```
page.goto() already waits for navigation, so waiting for yet another navigation with page.waitForNavigation() causes a timeout.

This is such a common mistake, I have a section about it in a blog post on typical Puppeteer mistakes. The solution is to remove the unnecessary page.waitForNavigation() line.

Secondly, use page.goto(url, {waitUntil: "domcontentloaded"} rather than the default "load" event. Some anti-scraping approaches (or poorly-coded pages) seem to defer the load event, causing navigation timeouts. "domcontentloaded" is the fastest approach and almost always preferred.

Going a step further, since the data is baked into the static HTML, you can block all resource requests and disable JS. Here’s an optimized script:
```
const puppeteer = require("puppeteer"); // ^21.4.1

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: "new"});
  const [page] = await browser.pages();
  await page.setJavaScriptEnabled(false);
  await page.setRequestInterception(true);
  page.on("request", req =>
    req.url() === url ? req.continue() : req.abort()
  );
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const firstPlayerName = await page.$eval(
    "td.text-left a",
    player => player.textContent.trim()
  );
  console.log("Scraping successful:", firstPlayerName);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());
```
Going yet another step further, you may not even need Puppeteer. You can make a request with fetch, native in Node 18+, and parse the data you want from the response with a lightweight library like Cheerio.
```
const cheerio = require("cheerio"); // ^1.0.0-rc.12

const url = "<Your URL>";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const firstPlayerName = $("td.text-left a").first().text()
    console.log(firstPlayerName); // => Maddie Bulbulia
  })
  .catch(err => console.error(err));
```
Here are some quick benchmarks.

Unoptimized Puppeteer (only using "domcontentloaded"):
```
real 0m2.974s
user 0m1.004s
sys  0m0.271s
```
Optimized Puppeteer (using DCL, plus disabling JS and blocking resources):
```
real 0m1.190s
user 0m0.510s
sys  0m0.114s
```
Fetch/Cheerio:
```
real 0m0.998s
user 0m0.261s
sys  0m0.049s
```
If the scraped data doesn’t change often, you might consider caching the results of the scrape periodically so you can serve it up to your users instantly and more reliably.
Login or Signup to reply.

- AdilParwez
- November 13, 2023 at 10:36 pm
- 0 votes
0
There are plenty of changes that you might want to check
1. You’re using both await page.goto and await page.waitForNavigation, which may lead to a conflict. you might want to remove the await page.waitForNavigation line.
```
 await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
```
2. If the page takes longer to load due to a slow network or other factors, you may want to increase the timeout for the $eval function.
```
 const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', { timeout: 60000 }, player => player.textContent.trim());
```
3. Before using $eval, you may want to ensure that the element you are trying to select actually exists on the page. Use page.waitForSelector to wait for the element to appear.
```
 await page.waitForSelector('tbody tr:first-child .text-left a');
 const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());
```
4. If the website is slow or experiencing network issues, increasing the timeout might not be enough. You can try adding a wait before making the request to ensure that the page is fully loaded.
```
await page.waitForTimeout(5000); 
await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
```
5. Additionally, you might want to check the console output in the Puppeteer browser to see if there are any errors or messages that could provide more insight into the problem. You can enable the headless mode ({ headless: false }) when launching Puppeteer to visually inspect what is happening on the page during the scraping process.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Javascript – Navigation Timeout Exceeded scraping table in Puppeteer

Answers