I have converted my (scraped) JSON response into a string for use in HTML. This program simply seeks to pull the title of a book off of Amazon, remove the JSON formatting, and output that title in regular string format within the body of my HTML.
Is there a proper way to implement one of the (replace) or (replaceAll) fragments I provided at the bottom of this thread into my code, or is there a different way to get this task done? Here is my code, and I preemptivly thank you all for the help.
JS Code (scrapers.js):
const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.get('/scrape', async (req, res) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');
const [el2] = await page.$x('//*[@id="productTitle"]');
const txt = await el2.getProperty('textContent');
const rawTxt = await txt.jsonValue();
const myObj = {rawTxt};
res.json(myObj); // Send the JSON response
browser.close();
});
app.listen(3000, () => {
console.log('Server is running on http://localhost:3000');
});
HTML Code (index.html):
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p id="myObjValue"></p>
<script>
fetch('/scrape') // Send a GET request to the server
.then(response => response.json())
.then(data => {
const myObjValueElement = document.getElementById('myObjValue');
myObjValueElement.textContent = `rawTxt: ${data.rawTxt}`;
})
.catch(error => console.error(error));
</script>
</body>
</html>
I have looked online, and the main solutions I have found are applying either of these to my string-converted JSON message:
.replaceAll("\{","")
.replaceAll("\}","")
.replaceAll("\[","")
.replaceAll("\]","")
.replaceAll(":","")
.replaceAll(",","")
.replaceAll(" ","");
.replace(/"/g, "") // Remove double quotes
.replace(/{/g, "") // Remove opening curly braces
.replace(/}/g, "") // Remove closing curly braces
.replace(/[/g, "") // Remove opening square brackets
.replace(/]/g, "") // Remove closing square brackets
.replace(/:/g, "") // Remove colons
.replace(/,/g, "") // Remove commas
.replace(/ /g, ""); // Remove spaces
Unfortunately, I have not been able to implement either of these solutions correctly, and each time the JSON-formated string {"rawTxt":" The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto) "} is output on my localhost 3000 within the browser.
I would like this to be output instead – The Black Swan: Second Edition: The Impact of the Highly Improbable: With a new section: "On Robustness and Fragility" (Incerto).
2
Answers
It appears that you’re starting the Express server, then navigating directly to the API route at http://localhost:3000/scrape rather than viewing the HTML page, which uses
fetch
to hit the API route. That missing step means you’ll see the raw JSON output from the API without the processing that the script in the HTML file does (i.e.response.json()
, which parses the JSON into a plain JS object).To serve the HTML page on the same origin as the server, you can adjust your server code as follows:
Then create a folder in the project root called
public
and moveindex.html
into it.Finally, restart the server and navigate to http://localhost:3000 (which serves index.html) and you should see the expected output.
In short, your code is correct, but you might have misunderstood how to serve and view the HTML page.
As an aside, you can simplify your Puppeteer selector and improve error handling to ensure you always close the browser:
Better yet, since the particular piece of data you want is baked into the static HTML, you can speed things up by disabling JS, blocking all requests except for the base HTML page and using
domcontentloaded
:You can also share one browser instance across all requests. See this example if you’re interested.
Here are some ways we could optimize this web scraping code:
Cache the browser instance – Open the browser once and reuse it for
multiple requests instead of launching a new browser on every
request.
Use async/await instead of promises – Makes the code a bit cleaner
by avoiding promise chains.
Extract constants – Move strings like the URL into constants to
avoid duplicates.
Use a template literal for setting text – Avoid concatenation when
setting the element text.
Error handling – Add some error handling in the scrape route in case
the page errors.
Use Headless mode – Launch chromium in headless mode for faster
processing without UI.
Parallelize requests – Handle multiple requests concurrently instead
of sequentially.
Persistent storage – Save scraped data in a cache or database to
avoid duplicate scrapes.
Streaming response – Stream the JSON response instead of building
the full object to improve memory usage.
Here is one way to implement some of these optimizations: