I’m trying to build a multi-search tool that searches a list of websites and outputs results from all of them on one page as an array.
The NPM package node-fetch works fine for most of the sites: fetch the desired URL, send the HTML to the front end and then use a DOM Parser to search through the HTML for the elements I want to extract data from.
Now, the last site I’m trying to search has un-rendered content which appears in the fetched HTML as {{Title}}
for example instead of the actual text title which I want to extract.
I realise the content is unrendered so after some searching I found Puppeteer and I’m trying to use it as below:
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
const HTML = await page.content()
await browser.close()
However, the output HTML
variable still contains the un-rendered content.
I saw another StackOverflow post that recommended changing line 3 to this: await page.goto(url, { waitUntil: 'networkidle' })
however this just throws Error: Unknown value for options.waitUntil: networkidle
.
Any help appreciated.
2
Answers
networkidle
is not an option–it’s eithernetworkidle0
ornetworkidle2
depending on how many in-flight requests you want to allow before resolving.But
networkidle
is not a good way to wait for particular data to load. It’s very pessimistic, waiting for a bunch of resources that you don’t care about. Networkidle can time out if too many requests stay unresolved for too long. It’s mainly suitable for screenshots, not scraping text.Generally, wait for a response to arrive or a DOM node to exist.
"domcontentloaded"
is the fastest and most reliable load condition Puppeteer currently supports, generally the one to use.Responses are often easier and more reliable to scrape than the DOM. The site you’re automating sends a request to
/api/search
and receives a JSON response containing the data you want. You can intercept this response withwaitForResponse
and grab the JSON payload without touching the document:One advantage here is you get all of the data you might ever want, including prices in raw, unformatted form.
But using the document isn’t too bad here either, assuming the selectors don’t change:
The quirk handled here is that the site loads with a template similar to
<div class="price-details">{{Price}}</div>
, so the standardwaitForSelector(".price-details")
would resolve prematurely.waitForFunction
is used instead to block until there’s more than just the template on the page. Later, the template is filtered out of the results.Disclosure: I’m the author of the linked blog post.
To get rendered HTML content using Puppeteer, you need to make sure that you are waiting for the page to fully load and render the dynamic content. The waitUntil option has specific values you can use, such as ‘networkidle2’ or ‘networkidle0’
In this script: