I am trying to scrape a website using puppeteer and cheerio. I have gotten the html of the page I want to scrape using puppeteer. I have loaded that html into cheerio.
async function run() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
const html = await get_to_page_with_required_source_code(page);
const $ = cheerio.load(html);
await browser.close();
}
run();
What I want to do now is remove all elements from the HTML that contains no text. Below is an example.
<div class="abc">
<img src="..." />
</div>
<div class="def">
<div class="jkl">
<span class="ghi">This is a text</span>
</div>
<div class="mno">This is another text</div>
</div>
The output of the above HTML should be:
<span class="ghi">This is a text</span>
<div class="mno">This is another text</div>
since these are the only two elements that contain text in them.
How can I accomplish this?
2
Answers
If you can, try implementing this in javascript:
For starters, generally don’t combine Puppeteer and Cheerio. Either the site is dynamic, in which case use Puppeteer and work directly with the live DOM (use jQuery if you like Cheerio syntax), or if the site is static, use fetch and Cheerio alone and skip the Puppeteer slowness.
Here’s one way to do it with Cheerio (you can toss in fetch to request the data if it’s a static site):
Here’s how to do it with Puppeteer:
Output is the same in both (you can add
.join("n")
if you want your output to be a string exactly as you posted it):Keep in mind: this is a bit of an odd thing to want to do, so there might be a better way to achieve whatever you’re really trying to achieve.