skip to Main Content

I have a Project with Node JS on web scraping where I will have to scrape Heading and Text from Main Content. But the Problem is I’m not able to Determine which is Main Content When there is No aside or main tag or class/id/role named aside or main. I’m Using Puppeteer and Cheerio Library. I have Tried using Mercury Web Parser But it has its Own problems. Like It doesn’t return any content from Pages that Built with Elementor Theme builder on WordPress. If anyone have any idea on how can I differentiate main content from rest of the web page it will be really helpful.

2

Answers


  1. You can checkout Readability JS library from Mozilla. They use for reader view.

    Login or Signup to reply.
  2. Try to explore more about CSS Selectors and specificity.
    If you’re scraping Elementor, be sure to use this trick for the selector:
    Use data-elementor-(attributename) attributes for everything in DOM.

    const mainContent = await page.waitForElement('[data-elementor-type="wp-page"]', {visible: true, timeout: 0})
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search