Javascript - puppeteer scraping dynamic content

chenzen
July 15, 2024
144 views
1 vote
2 Answers

I’m trying to scrape data from a Looker Studio web page report using Puppeteer in Node.js, but I’m encountering issues because the report is dynamic. When I fetch the data, the body is empty. Here’s

import puppeteer from 'puppeteer';

async function fetchData() {
  try {
    const url = 'https://lookerstudio.google.com/u/0/reporting/e36054dd-ffc0-4ef4-b8ab-4d10f7ab4cda/page/wmP0D';
    const options = {
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process',
        '--disable-gpu'
      ],
      headless: true
    };
    const browser = await puppeteer.launch(options);
    const page = await browser.newPage();

    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36');
    await page.setViewport({width: 1920, height: 1080});
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      if (req.resourceType() === 'stylesheet' || req.resourceType() === 'font' || req.resourceType() === 'image') {
        req.abort();
      } else {
        req.continue();
      }
    });

    await page.goto(url, {waitUntil: 'networkidle0'});

    await page.waitForSelector('.looker-report', { timeout: 60000 });

    const text = await page.evaluate(() => {
      return document.body.innerText;
    });

    console.log(text);  // This logs an empty string

    await page.close();
    await browser.close();
  } catch (error) {
    console.error('Error fetching data:', error);
  }
}

fetchData();

The issue I’m facing is that the text is always empty, even though I can see the data when I open the URL in a browser.
How can I modify my Puppeteer script to successfully scrape the dynamically loaded content from this Looker Studio report?

Any help or guidance would be greatly appreciated. Thank you!

I’ve tried:

Waiting for the ‘.looker-report’ selector
Using ‘networkidle0’ as the wait condition
Setting a longer timeout

What I want to do: If you open the link, the page has a table, I am trying to fetch the rows of the table. The first few rows of the table.

However, none of these approaches have worked. The page seems to load its content dynamically, and I’m not sure how to capture this data.

Answers

This can be due to a few reasons mainly it can be due to lesser timeout configuration. This can be resolved by increasing the time-out period until a certain page loads its content entirely. Secondly, rather than waiting for on element .looker-report only identify which needs to be waited until the data is being rendered or being fetched once those are successful only extract the data.
Finally, you can use page.setRequestInterception(true) to wait for certain operations/actions completed explicitly.

Refer the below code for the above mentioned modifications:

import puppeteer from 'puppeteer';

async function fetchData() {
  try {
    const url = 'https://lookerstudio.google.com/u/0/reporting/e36054dd-ffc0-4ef4-b8ab-4d10f7ab4cda/page/wmP0D';
    const options = {
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process',
        '--disable-gpu'
      ],
      headless: true
    };
    const browser = await puppeteer.launch(options);
    const page = await browser.newPage();

    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36');
    await page.setViewport({ width: 1920, height: 1080 });


    await page.setRequestInterception(true);     // Introduce Intercept network requests
    page.on('request', (req) => {
      if (req.resourceType() === 'stylesheet' || req.resourceType() === 'font' || req.resourceType() === 'image') {
        req.abort();
      } else {
        req.continue();
      }
    });

    await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });

    
    await page.waitForSelector('.looker-report', { timeout: 60000 }); // Wait for a specific DOM element to appear and rendered

    
    await page.waitForFunction(() => { // Wait for a custom condition or specific data to appear    
    const dataElement = document.querySelector('.data-element');  //  Wait until a .data element is populated
      return dataElement && dataElement.textContent.trim() !== '';
    }, { timeout: 60000 });

    const text = await page.evaluate(() => {
      return document.body.innerText;
    });
    console.log(text);
    await browser.close();
  } catch (error) {
    console.error('Error fetching data:', error);
  }
}

fetchData();

Hope it helps 🙂

I had a look at the document returned and don’t see a .looker-report element. I dug around and it looks like the table has table class, so I’m waiting for that.

import puppeteer from 'puppeteer';

async function fetchData() {
  try {
    const url = 'https://lookerstudio.google.com/u/0/reporting/e36054dd-ffc0-4ef4-b8ab-4d10f7ab4cda/page/wmP0D';
    const options = {
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process',
        '--disable-gpu'
      ],
      // Show the browser window.
      headless: false
    };
    const browser = await puppeteer.launch(options);
    const page = await browser.newPage();

    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36');
    await page.setViewport({width: 1920, height: 1080});
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      if (req.resourceType() === 'stylesheet' || req.resourceType() === 'font' || req.resourceType() === 'image') {
        req.abort();
      } else {
        req.continue();
      }
    });

    // Use networkidle2 rather than networkidle0.
    await page.goto(url, {waitUntil: 'networkidle2'});

    await page.waitForSelector('.table', { timeout: 60000 });

    const text = await page.evaluate(() => {
      return document.body.innerText;
    });

    console.log(text);

    await page.close();
    await browser.close();
  } catch (error) {
    console.error('Error fetching data:', error);
  }
}

fetchData();

Output:

DLMM Max Fees Opportunities
Reset
Share
arrow_drop_down
DLMM Max Fees Opportunities
Pair Name
DEX
Meteora
Bin Step
TVL
FDV
24hr Changes
Max 1d Fees
Max 1d Fees / TVL
▼
Dog-SOL
400
5.3K
794.4K
-0.46%
$38,745
736.31%
FIST-SOL
80
2.2K
140.7K
-0.92%
$9,527
425.47%
BOB-SOL
80
4.8K
1.6M
4.66%
$20,169
417.32%
EAR-SOL
200
4.5K
3.7M
-0.81%
$3,732
82.10%
KENZO-SOL
100
2.4K
336.1K
0.75%
$1,839
78.08%
EAR-SOL
400
107.5K
3.7M
-0.82%
$76,579
71.26%
SARB-BOB
400
4.5K
399.1K
0.06%
$3,118
68.69%
SIGMA-SOL
250
1.7K
1.9M
0.10%
$977
56.83%
EAR-SOL
80
6.2K
3.9M
-0.82%
$2,958
47.66%
MOB-SOL
100
50.9K
5.2M
0.82%
$21,076
41.42%
1 - 50 / 97
<
>
Data Last Updated: 15/07/2024 06:08:33
Privacy Policy

Don’t seem to get the whole expected content, but at least getting something!

Please signup or login to give your own answer.

Click here to cancel reply.

Javascript – puppeteer scraping dynamic content

Answers