I have a Project with Node JS on web scraping where I will have to scrape Heading and Text from Main Content. But the Problem is I’m not able to Determine which is Main Content When there is No aside
or main
tag or class/id/role named aside
or main
. I’m Using Puppeteer and Cheerio Library. I have Tried using Mercury Web Parser But it has its Own problems. Like It doesn’t return any content from Pages that Built with Elementor Theme builder on WordPress. If anyone have any idea on how can I differentiate main content from rest of the web page it will be really helpful.
2
Answers
You can checkout Readability JS library from Mozilla. They use for reader view.
Try to explore more about CSS Selectors and specificity.
If you’re scraping Elementor, be sure to use this trick for the selector:
Use
data-elementor-(attributename)
attributes for everything in DOM.