skip to Main Content

I have converted my (scraped) JSON response into a string for use in HTML. This program simply seeks to pull the title of a book off of Amazon, remove the JSON formatting, and output that title in regular string format within the body of my HTML.

Is there a proper way to implement one of the (replace) or (replaceAll) fragments I provided at the bottom of this thread into my code, or is there a different way to get this task done? Here is my code, and I preemptivly thank you all for the help.

JS Code (scrapers.js):

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();

app.get('/scrape', async (req, res) => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');

  const [el2] = await page.$x('//*[@id="productTitle"]');
  const txt = await el2.getProperty('textContent');
  const rawTxt = await txt.jsonValue();
  
  const myObj = {rawTxt};

  res.json(myObj); // Send the JSON response

  browser.close();
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
}); 

HTML Code (index.html):


<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <p id="myObjValue"></p>
  <script>
    fetch('/scrape') // Send a GET request to the server
      .then(response => response.json())
      .then(data => {
        const myObjValueElement = document.getElementById('myObjValue');
        myObjValueElement.textContent = `rawTxt: ${data.rawTxt}`;
      })
      .catch(error => console.error(error));
  </script>
</body>
</html>

I have looked online, and the main solutions I have found are applying either of these to my string-converted JSON message:

.replaceAll("\{","")
.replaceAll("\}","")
.replaceAll("\[","")
.replaceAll("\]","")
.replaceAll(":","")
.replaceAll(",","")
.replaceAll(" ","");
.replace(/"/g, "")       // Remove double quotes
.replace(/{/g, "")       // Remove opening curly braces
.replace(/}/g, "")       // Remove closing curly braces
.replace(/[/g, "")       // Remove opening square brackets
.replace(/]/g, "")       // Remove closing square brackets
.replace(/:/g, "")        // Remove colons
.replace(/,/g, "")        // Remove commas
.replace(/ /g, "");       // Remove spaces

Unfortunately, I have not been able to implement either of these solutions correctly, and each time the JSON-formated string {"rawTxt":" The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto) "} is output on my localhost 3000 within the browser.

I would like this to be output instead – The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto).

2

Answers


  1. It appears that you’re starting the Express server, then navigating directly to the API route at http://localhost:3000/scrape rather than viewing the HTML page, which uses fetch to hit the API route. That missing step means you’ll see the raw JSON output from the API without the processing that the script in the HTML file does (i.e. response.json(), which parses the JSON into a plain JS object).

    To serve the HTML page on the same origin as the server, you can adjust your server code as follows:

    const puppeteer = require('puppeteer');
    const express = require('express');
    const app = express();
    app.use(express.static('public')); // <-- added
    
    app.get('/scrape', async (req, res) => {
    // ...
    

    Then create a folder in the project root called public and move index.html into it.

    Finally, restart the server and navigate to http://localhost:3000 (which serves index.html) and you should see the expected output.

    In short, your code is correct, but you might have misunderstood how to serve and view the HTML page.


    As an aside, you can simplify your Puppeteer selector and improve error handling to ensure you always close the browser:

    app.get('/scrape', async (req, res) => {
      let browser;
      try {
        browser = await puppeteer.launch();
        const [page] = await browser.pages();
        await page.goto('<Your URL>'); // TODO: paste your URL here!
        const rawTxt = await page.$eval(
          '#productTitle',
          el => el.textContent.trim()
        );
        res.json({rawTxt});
      }
      catch (err) {
        console.error(err);
      }
      finally {
        await browser?.close();
      }
    });
    

    Better yet, since the particular piece of data you want is baked into the static HTML, you can speed things up by disabling JS, blocking all requests except for the base HTML page and using domcontentloaded:

    const puppeteer = require("puppeteer"); // ^21.2.1
    const express = require("express"); // ^4.18.2
    const app = express();
    app.use(express.static("public"));
    
    const url = "<Your URL>";
    
    app.get("/scrape", async (req, res) => {
      let browser;
      try {
        browser = await puppeteer.launch({headless: "new"});
        const [page] = await browser.pages();
        await page.setJavaScriptEnabled(false);
        await page.setRequestInterception(true);
        page.on("request", req =>
          req.url() === url ? req.continue() : req.abort()
        );
        await page.goto(url, {waitUntil: "domcontentloaded"});
        const rawTxt = await page.$eval("#productTitle", el =>
          el.textContent.trim()
        );
        res.json({rawTxt});
      }
      catch (err) {
        console.error(err);
      }
      finally {
        await browser?.close();
      }
    });
    
    app.listen(3000, () => {
      console.log("Server is running on http://localhost:3000");
    });
    

    You can also share one browser instance across all requests. See this example if you’re interested.

    Login or Signup to reply.
  2. Here are some ways we could optimize this web scraping code:

    1. Cache the browser instance – Open the browser once and reuse it for
      multiple requests instead of launching a new browser on every
      request.

    2. Use async/await instead of promises – Makes the code a bit cleaner
      by avoiding promise chains.

    3. Extract constants – Move strings like the URL into constants to
      avoid duplicates.

    4. Use a template literal for setting text – Avoid concatenation when
      setting the element text.

    5. Error handling – Add some error handling in the scrape route in case
      the page errors.

    6. Use Headless mode – Launch chromium in headless mode for faster
      processing without UI.

    7. Parallelize requests – Handle multiple requests concurrently instead
      of sequentially.

    8. Persistent storage – Save scraped data in a cache or database to
      avoid duplicate scrapes.

    9. Streaming response – Stream the JSON response instead of building
      the full object to improve memory usage.
      Here is one way to implement some of these optimizations:

      // scraper.js
      const { URL } = require('./constants'); 
      
      app.get('/scrape', async (req, res) => {
        try {
          const browser = getBrowser(); // cached instance
          const page = await browser.newPage();
          await page.goto(URL);
      
          const title = await getTitle(page); 
      
          res.json({ title }); // stream JSON response  
        } catch(e) {
          console.error(e);
          res.sendStatus(500);
        }
      });
      
      // page-utils.js
      const getTitle = async (page) => {
        const [el] = await page.$x('//*[@id="productTitle"]');
        const txt = await el.getProperty('textContent');
        return txt.jsonValue(); 
      }
      
      // stream response, error handling, async/await, template literal
      
      
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search