How can I remove JSON formatting for use in HTML?

Evan
October 27, 2023
153 views
1 vote
2 Answers

I have converted my (scraped) JSON response into a string for use in HTML. This program simply seeks to pull the title of a book off of Amazon, remove the JSON formatting, and output that title in regular string format within the body of my HTML.

Is there a proper way to implement one of the (replace) or (replaceAll) fragments I provided at the bottom of this thread into my code, or is there a different way to get this task done? Here is my code, and I preemptivly thank you all for the help.

JS Code (scrapers.js):

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();

app.get('/scrape', async (req, res) => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');

  const [el2] = await page.$x('//*[@id="productTitle"]');
  const txt = await el2.getProperty('textContent');
  const rawTxt = await txt.jsonValue();
  
  const myObj = {rawTxt};

  res.json(myObj); // Send the JSON response

  browser.close();
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

HTML Code (index.html):


<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <p id="myObjValue"></p>
  <script>
    fetch('/scrape') // Send a GET request to the server
      .then(response => response.json())
      .then(data => {
        const myObjValueElement = document.getElementById('myObjValue');
        myObjValueElement.textContent = `rawTxt: ${data.rawTxt}`;
      })
      .catch(error => console.error(error));
  </script>
</body>
</html>

I have looked online, and the main solutions I have found are applying either of these to my string-converted JSON message:

.replaceAll("\{","")
.replaceAll("\}","")
.replaceAll("\[","")
.replaceAll("\]","")
.replaceAll(":","")
.replaceAll(",","")
.replaceAll(" ","");

.replace(/"/g, "")       // Remove double quotes
.replace(/{/g, "")       // Remove opening curly braces
.replace(/}/g, "")       // Remove closing curly braces
.replace(/[/g, "")       // Remove opening square brackets
.replace(/]/g, "")       // Remove closing square brackets
.replace(/:/g, "")        // Remove colons
.replace(/,/g, "")        // Remove commas
.replace(/ /g, "");       // Remove spaces

Unfortunately, I have not been able to implement either of these solutions correctly, and each time the JSON-formated string {"rawTxt":" The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto) "} is output on my localhost 3000 within the browser.

I would like this to be output instead – The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto).

Answers

It appears that you’re starting the Express server, then navigating directly to the API route at http://localhost:3000/scrape rather than viewing the HTML page, which uses fetch to hit the API route. That missing step means you’ll see the raw JSON output from the API without the processing that the script in the HTML file does (i.e. response.json(), which parses the JSON into a plain JS object).

To serve the HTML page on the same origin as the server, you can adjust your server code as follows:

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static('public')); // <-- added

app.get('/scrape', async (req, res) => {
// ...

Then create a folder in the project root called public and move index.html into it.

Finally, restart the server and navigate to http://localhost:3000 (which serves index.html) and you should see the expected output.

In short, your code is correct, but you might have misunderstood how to serve and view the HTML page.

As an aside, you can simplify your Puppeteer selector and improve error handling to ensure you always close the browser:

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    browser = await puppeteer.launch();
    const [page] = await browser.pages();
    await page.goto('<Your URL>'); // TODO: paste your URL here!
    const rawTxt = await page.$eval(
      '#productTitle',
      el => el.textContent.trim()
    );
    res.json({rawTxt});
  }
  catch (err) {
    console.error(err);
  }
  finally {
    await browser?.close();
  }
});

Better yet, since the particular piece of data you want is baked into the static HTML, you can speed things up by disabling JS, blocking all requests except for the base HTML page and using domcontentloaded:

const puppeteer = require("puppeteer"); // ^21.2.1
const express = require("express"); // ^4.18.2
const app = express();
app.use(express.static("public"));

const url = "<Your URL>";

app.get("/scrape", async (req, res) => {
  let browser;
  try {
    browser = await puppeteer.launch({headless: "new"});
    const [page] = await browser.pages();
    await page.setJavaScriptEnabled(false);
    await page.setRequestInterception(true);
    page.on("request", req =>
      req.url() === url ? req.continue() : req.abort()
    );
    await page.goto(url, {waitUntil: "domcontentloaded"});
    const rawTxt = await page.$eval("#productTitle", el =>
      el.textContent.trim()
    );
    res.json({rawTxt});
  }
  catch (err) {
    console.error(err);
  }
  finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log("Server is running on http://localhost:3000");
});

You can also share one browser instance across all requests. See this example if you’re interested.

- pcking60
- October 27, 2023 at 4:05 am
- 0 votes
0
Here are some ways we could optimize this web scraping code:
1. Cache the browser instance – Open the browser once and reuse it for
  multiple requests instead of launching a new browser on every
  request.
2. Use async/await instead of promises – Makes the code a bit cleaner
  by avoiding promise chains.
3. Extract constants – Move strings like the URL into constants to
  avoid duplicates.
4. Use a template literal for setting text – Avoid concatenation when
  setting the element text.
5. Error handling – Add some error handling in the scrape route in case
  the page errors.
6. Use Headless mode – Launch chromium in headless mode for faster
  processing without UI.
7. Parallelize requests – Handle multiple requests concurrently instead
  of sequentially.
8. Persistent storage – Save scraped data in a cache or database to
  avoid duplicate scrapes.
9. Streaming response – Stream the JSON response instead of building
  the full object to improve memory usage.
  Here is one way to implement some of these optimizations:
```
// scraper.js
const { URL } = require('./constants'); 

app.get('/scrape', async (req, res) => {
  try {
    const browser = getBrowser(); // cached instance
    const page = await browser.newPage();
    await page.goto(URL);

    const title = await getTitle(page); 

    res.json({ title }); // stream JSON response  
  } catch(e) {
    console.error(e);
    res.sendStatus(500);
  }
});

// page-utils.js
const getTitle = async (page) => {
  const [el] = await page.$x('//*[@id="productTitle"]');
  const txt = await el.getProperty('textContent');
  return txt.jsonValue(); 
}

// stream response, error handling, async/await, template literal
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.