I am trying to scrape the very first name on a table from a website that presents a basketball team and that team’s player’s names and statistics. When I do so the Navigation Timeout is exceeded meaning the value was not scraped in the given time, and on my client-side "Error Loading Data" appears. What am I doing wrong?
FYI – There are various debugging statements used that are not essential to the functioning of the code.
Here is my JavaScript code:
const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static("public"));
app.get('/scrape', async (req, res) => {
let browser;
try {
console.log('Attempting to scrape data...');
browser = await puppeteer.launch();
const [page] = await browser.pages();
// Increase the timeout to 60 seconds
await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
// Wait for navigation to complete
await page.waitForNavigation({ timeout: 60000 });
const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());
console.log('Scraping successful:', firstPlayerName);
res.json({ firstPlayerName });
} catch (err) {
console.error('Error during scraping:', err);
res.status(500).json({ error: 'Internal Server Error' });
} finally {
await browser?.close();
}
});
app.listen(3000, () => {
console.log('Server is running on http://localhost:3000');
});
Here is my HTML code:
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<table>
<p class="robo-header">Robo-Scout </p>
<p class="robo-subheader"><br> Official Algorithmic Bakstball Scout</p>
<tr>
<td>
<p id="myObjValue"> Loading... </p>
<script>
fetch('/scrape') // Send a GET request to the server
.then(response => {
if (!response.ok) {
throw new Error('Network response was not ok');
}
return response.json();
})
.then(data => {
console.log(data); // Check what data is received
const myObjValueElement = document.getElementById('myObjValue');
myObjValueElement.textContent = data.firstPlayerName || 'Player name not found';
})
.catch(error => {
console.error(error);
const myObjValueElement = document.getElementById('myObjValue');
myObjValueElement.textContent = 'Error loading data'; // Display an error message
});
</script>
</td>
</tr>
</table>
</body>
</html>
Here is the code from the cell of the table I’m trying to scrape:
<td class="text-left">
<a href="/player/maddie-bulbulia/girlsbasketball/season/2022-2023">Maddie Bulbulia</a> <small class="text-muted">Sophomore • G</small>
</td>
I have tried debugging the code, to trace why the value isn’t being pulled, by outputting when the value is not pulled, and the error. I have also tried increasing the Navigation Timeout to 60 seconds rather than 30 just in case my network was moving slowly, no changes.
2
Answers
This code looks problematic:
page.goto()
already waits for navigation, so waiting for yet another navigation withpage.waitForNavigation()
causes a timeout.This is such a common mistake, I have a section about it in a blog post on typical Puppeteer mistakes. The solution is to remove the unnecessary
page.waitForNavigation()
line.Secondly, use
page.goto(url, {waitUntil: "domcontentloaded"}
rather than the default"load"
event. Some anti-scraping approaches (or poorly-coded pages) seem to defer the load event, causing navigation timeouts."domcontentloaded"
is the fastest approach and almost always preferred.Going a step further, since the data is baked into the static HTML, you can block all resource requests and disable JS. Here’s an optimized script:
Going yet another step further, you may not even need Puppeteer. You can make a request with
fetch
, native in Node 18+, and parse the data you want from the response with a lightweight library like Cheerio.Here are some quick benchmarks.
Unoptimized Puppeteer (only using
"domcontentloaded"
):Optimized Puppeteer (using DCL, plus disabling JS and blocking resources):
Fetch/Cheerio:
If the scraped data doesn’t change often, you might consider caching the results of the scrape periodically so you can serve it up to your users instantly and more reliably.
There are plenty of changes that you might want to check
You’re using both await page.goto and await page.waitForNavigation, which may lead to a conflict. you might want to remove the await page.waitForNavigation line.
If the page takes longer to load due to a slow network or other factors, you may want to increase the timeout for the $eval function.
Before using $eval, you may want to ensure that the element you are trying to select actually exists on the page. Use page.waitForSelector to wait for the element to appear.
If the website is slow or experiencing network issues, increasing the timeout might not be enough. You can try adding a wait before making the request to ensure that the page is fully loaded.
Additionally, you might want to check the console output in the Puppeteer browser to see if there are any errors or messages that could provide more insight into the problem. You can enable the headless mode ({ headless: false }) when launching Puppeteer to visually inspect what is happening on the page during the scraping process.