Javascript - Get rendered HTML from an external page with Node.JS

ld98
June 5, 2024
189 views
0 votes
2 Answers

I’m trying to build a multi-search tool that searches a list of websites and outputs results from all of them on one page as an array.

The NPM package node-fetch works fine for most of the sites: fetch the desired URL, send the HTML to the front end and then use a DOM Parser to search through the HTML for the elements I want to extract data from.

Now, the last site I’m trying to search has un-rendered content which appears in the fetched HTML as {{Title}} for example instead of the actual text title which I want to extract.

I realise the content is unrendered so after some searching I found Puppeteer and I’m trying to use it as below:

const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
const HTML = await page.content()
await browser.close()

However, the output HTML variable still contains the un-rendered content.

I saw another StackOverflow post that recommended changing line 3 to this: await page.goto(url, { waitUntil: 'networkidle' }) however this just throws Error: Unknown value for options.waitUntil: networkidle.

Any help appreciated.

Answers

- ggorlen
- June 5, 2024 at 8:10 pm
- 0 votes
0
networkidle is not an option–it’s either networkidle0 or networkidle2 depending on how many in-flight requests you want to allow before resolving.

But networkidle is not a good way to wait for particular data to load. It’s very pessimistic, waiting for a bunch of resources that you don’t care about. Networkidle can time out if too many requests stay unresolved for too long. It’s mainly suitable for screenshots, not scraping text.

Generally, wait for a response to arrive or a DOM node to exist. "domcontentloaded" is the fastest and most reliable load condition Puppeteer currently supports, generally the one to use.

Responses are often easier and more reliable to scrape than the DOM. The site you’re automating sends a request to /api/search and receives a JSON response containing the data you want. You can intercept this response with waitForResponse and grab the JSON payload without touching the document:
```
const puppeteer = require("puppeteer"); // ^22.7.1

let browser;
(async () => {
  const url = "<Your URL>";
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const res = await page.waitForResponse(res =>
    res.url().endsWith("/api/search")
  );
  const {data: {dataset}} = await res.json();
  console.log(dataset.map(e => ({price: e.price, vrn: e.vrn})));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());
```
One advantage here is you get all of the data you might ever want, including prices in raw, unformatted form.

But using the document isn’t too bad here either, assuming the selectors don’t change:
```
const puppeteer = require("puppeteer");

let browser;
(async () => {
  const url = "<Your URL>";
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForFunction(`
    document.querySelectorAll(".price-details").length > 1
  `);
  const data = await page.$$eval(".featured-item", els => els.map(e => ({
    plate: e.querySelector(".btn-plate").textContent.trim(),
    price: e.querySelector(".price-details").textContent.trim(),
  })));
  console.log(data.filter(e => !e.plate.includes("{{")));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());
```
The quirk handled here is that the site loads with a template similar to <div class="price-details">{{Price}}</div>, so the standard waitForSelector(".price-details") would resolve prematurely. waitForFunction is used instead to block until there’s more than just the template on the page. Later, the template is filtered out of the results.

_{Disclosure: I’m the author of the linked blog post.}
Login or Signup to reply.

- TariqulIslam
- June 5, 2024 at 9:08 pm
- 0 votes
0
To get rendered HTML content using Puppeteer, you need to make sure that you are waiting for the page to fully load and render the dynamic content. The waitUntil option has specific values you can use, such as ‘networkidle2’ or ‘networkidle0’
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); // Go to the URL and wait until the network is idle await page.goto('https://example.com', { waitUntil: 'networkidle2' }); // Extract the rendered HTML content const HTML = await page.content(); await browser.close(); console.log(HTML); })();
In this script:
1. Puppeteer is launched and a new page is created.
2. page.goto(url, { waitUntil: ‘networkidle2’ }) is used to navigate to the URL and wait until the network is idle (no more than 2 network connections for at least 500 ms).
3. The rendered HTML content is retrieved with page.content().
4. The browser is closed.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Javascript – Get rendered HTML from an external page with Node.JS

Answers