I’m trying to get table from this site using Cheerio lib in google app script. I’m put some code below this answer, but getting only [] in console.log()
There is my code
function test2() {
const url = 'https://github.com/labnol/apps-script-starter/blob/master/scopes.md';
const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
const $ = Cheerio.load(res);
var data = $('tbody').find('td').toArray().map((x) => { return $(x).text() });
console.log(data);
}
I also see some answers:
2
Answers
My way of solving this problem is a bit rough, but here is what I have managed to achieve. I am still a self-taught novice developer)
Here's my code:
The conclusion would be this:
For starters, you might want to use GitHub’s API, avoiding the pitfalls of web scraping.
If you do want to stick with GAS and avoid the API, the issue seems to be that the page served is different than the one in the browser. I determined this by adding
DriveApp.createFile("test.html", res);
to bypass log truncation (apparently, there is no better way according to TheMaster). From this output HTML, it’s apparent that the data is available only in a React JSON string inside a script tag, which can be extracted with Cheerio, parsed withJSON.parse()
and traversed.However, an easier option may be to request the raw markdown and either convert it to HTML with marked and proceed with Cheerio, or parse the table by hand. I’ll use the latter option since I’m not too familiar with the GAS package ecosystem:
Output:
Parsing the raw markdown is a bit hacky, but should be reliable enough. If it proves not to be, try one of the other options.
If you’re not married to using GAS, your original code works for me in Node 20.11.1:
Output:
Although this works, the array shown above is too flat to be usable–essentially one giant row. I would use a nested row and cell based scrape to preserve the tabular nature of the data and avoid flattening it out.
Here’s the output, which is similar to the GAS script output (remove the slice calls to see all of the data, without truncation):
You can process this further to group on sub-categories. Rows with two empty cells are a delimiter between a scope category (I think–I’m not a domain expert), while rows with an empty right cell are category headers. Here’s an example that groups by sub-categories and attaches headers to each cell:
I tested this example processing code in both GAS and Node.