I’m trying to scrape data from a Looker Studio web page report using Puppeteer in Node.js, but I’m encountering issues because the report is dynamic. When I fetch the data, the body is empty. Here’s
import puppeteer from 'puppeteer';
async function fetchData() {
try {
const url = 'https://lookerstudio.google.com/u/0/reporting/e36054dd-ffc0-4ef4-b8ab-4d10f7ab4cda/page/wmP0D';
const options = {
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu'
],
headless: true
};
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36');
await page.setViewport({width: 1920, height: 1080});
await page.setRequestInterception(true);
page.on('request', (req) => {
if (req.resourceType() === 'stylesheet' || req.resourceType() === 'font' || req.resourceType() === 'image') {
req.abort();
} else {
req.continue();
}
});
await page.goto(url, {waitUntil: 'networkidle0'});
await page.waitForSelector('.looker-report', { timeout: 60000 });
const text = await page.evaluate(() => {
return document.body.innerText;
});
console.log(text); // This logs an empty string
await page.close();
await browser.close();
} catch (error) {
console.error('Error fetching data:', error);
}
}
fetchData();
The issue I’m facing is that the text is always empty, even though I can see the data when I open the URL in a browser.
How can I modify my Puppeteer script to successfully scrape the dynamically loaded content from this Looker Studio report?
Any help or guidance would be greatly appreciated. Thank you!
I’ve tried:
- Waiting for the ‘.looker-report’ selector
- Using ‘networkidle0’ as the wait condition
- Setting a longer timeout
What I want to do: If you open the link, the page has a table, I am trying to fetch the rows of the table. The first few rows of the table.
However, none of these approaches have worked. The page seems to load its content dynamically, and I’m not sure how to capture this data.
2
Answers
This can be due to a few reasons mainly it can be due to lesser timeout configuration. This can be resolved by increasing the time-out period until a certain page loads its content entirely. Secondly, rather than waiting for on element
.looker-report
only identify which needs to be waited until the data is being rendered or being fetched once those are successful only extract the data.Finally, you can use
page.setRequestInterception(true)
to wait for certain operations/actions completed explicitly.Refer the below code for the above mentioned modifications:
Hope it helps 🙂
I had a look at the document returned and don’t see a
.looker-report
element. I dug around and it looks like the table hastable
class, so I’m waiting for that.Output:
Don’t seem to get the whole expected content, but at least getting something!