skip to Main Content

I’m trying to get table from this site using Cheerio lib in google app script. I’m put some code below this answer, but getting only [] in console.log()

There is my code

function test2() {
  const url = 'https://github.com/labnol/apps-script-starter/blob/master/scopes.md';
  const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
  const $ = Cheerio.load(res);
  var data = $('tbody').find('td').toArray().map((x) => { return $(x).text() });
  console.log(data);
}

I also see some answers:

  1. one

  2. two

    But they didn’t give me any clue on how to get the desired result

2

Answers


  1. Chosen as BEST ANSWER

    My way of solving this problem is a bit rough, but here is what I have managed to achieve. I am still a self-taught novice developer)

    Here's my code:

        function myFunction() {
      var [c, arr] = [[], []];
      UrlFetchApp.fetch("https://raw.githubusercontent.com/labnol/apps-script-starter/master/scopes.md")
        .getContentText()
        .split("n")
        .forEach((item, i) => {
          c.push(item.replace(/[*`|]/g, " ").trim().split("|||"))
        });
      var v = c.splice(19);
      v.forEach((item, i) => {
        if (item.length == 2 && item[0] == '' && item[1] == '') {
          v.splice(i, 1);
        }
      })
      v.forEach((item, i) => {
        if (item[0].indexOf("API") == -1) {
          arr[arr.length - 1].push(item[0]);
        } else { arr.push(item); }
      })
      var g = tesssst(arr);
      console.log(g);
    }
    
         function tesssst(inputArray) {
              var outputArray = [];
              inputArray.forEach((row, i) => {
                var newRow = [];
                newRow.push(row[0]);
                newRow.push(row.slice(1).join(','));
                outputArray.push(newRow);
              });
              return outputArray;
            }
    

    The conclusion would be this:

    [ [ 'Cloud SQL Admin API v1beta4',
        'View and manage your data across Google Cloud Platform services  https://www.googleapis.com/auth/cloud-platform,Manage your Google SQL Service instances  https://www.googleapis.com/auth/sqlservice.admin,' ],
      [ 'Android Management API v1',
        'Manage Android devices and apps for your customers  https://www.googleapis.com/auth/androidmanagement,' ],
      [ 'YouTube Data API v3',
        'Manage your YouTube account  https://www.googleapis.com/auth/youtube,See, edit, and permanently delete your YouTube videos, ratings, comments and captions  https://www.googleapis.com/auth/youtube.force-ssl,View your YouTube account  https://www.googleapis.com/auth/youtube.readonly,Manage your YouTube videos  https://www.googleapis.com/auth/youtube.upload,View and manage your assets and associated content on YouTube  https://www.googleapis.com/auth/youtubepartner,View private information of your YouTube channel relevant during the audit process with a YouTube partner  https://www.googleapis.com/auth/youtubepartner-channel-audit,' ],
      [ 'Cloud Testing API v1',
        'View and manage your data across Google Cloud Platform services  https://www.googleapis.com/auth/cloud-platform,View your data across Google Cloud Platform services  https://www.googleapis.com/auth/cloud-platform.read-only,' ],
      [ 'DoubleClick Search API v2',
        'View and manage your advertising data in DoubleClick Search  https://www.googleapis.com/auth/doubleclicksearch,' ],
      [ 'Tasks API v1',
        'Create, edit, organize, and delete all your tasks  https://www.googleapis.com/auth/tasks,View your tasks  https://www.googleapis.com/auth/tasks.readonly,' ],
      [ 'Calendar API v3',
        'See, edit, share, and permanently delete all the calendars you can access using Google Calendar  https://www.googleapis.com/auth/calendar,View and edit events on all your calendars  https://www.googleapis.com/auth/calendar.events,View events on all your calendars  https://www.googleapis.com/auth/calendar.events.readonly,View your calendars  https://www.googleapis.com/auth/calendar.readonly,View your Calendar settings  https://www.googleapis.com/auth/calendar.settings.readonly,' ],
      [ 'Google Play Custom App Publishing API v1',
        'View and manage your Google Play Developer account  https://www.googleapis.com/auth/androidpublisher,' ],
      [ 'YouTube Analytics API v2',
        'Manage your YouTube account  https://www.googleapis.com/auth/youtube,View your YouTube account  https://www.googleapis.com/auth/youtube.readonly,View and manage your assets and associated content on YouTube  https://www.googleapis.com/auth/youtubepartner,View monetary and non-monetary YouTube Analytics reports for your YouTube content  https://www.googleapis.com/auth/yt-analytics-monetary.readonly,View YouTube Analytics reports for your YouTube content  https://www.googleapis.com/auth/yt-analytics.readonly,' ],
      [ 'Cloud Healthcare API v1alpha2',
        'View and manage your data across Google Cloud Platform services  https://www.googleapis.com/auth/cloud-platform,' ],
      [ 'Cloud Shell API v1',
        'View and manage your data across Google Cloud Platform services  https://www.googleapis.com/auth/cloud-platform,' ],
      [ 'Content API for Shopping v2.1',
        'Manage your product listings and accounts for Google Shopping  https://www.googleapis.com/auth/content,' ]]
    

  2. For starters, you might want to use GitHub’s API, avoiding the pitfalls of web scraping.

    If you do want to stick with GAS and avoid the API, the issue seems to be that the page served is different than the one in the browser. I determined this by adding DriveApp.createFile("test.html", res); to bypass log truncation (apparently, there is no better way according to TheMaster). From this output HTML, it’s apparent that the data is available only in a React JSON string inside a script tag, which can be extracted with Cheerio, parsed with JSON.parse() and traversed.

    However, an easier option may be to request the raw markdown and either convert it to HTML with marked and proceed with Cheerio, or parse the table by hand. I’ll use the latter option since I’m not too familiar with the GAS package ecosystem:

    function myFunction() { // default GAS function name
      const url = "https://raw.githubusercontent.com/labnol/apps-script-starter/master/scopes.md";
      const res = UrlFetchApp.fetch(url).getContentText();
      const data = [];
      
      for (const line of res.split("n")) {
        const chunks = line
          .replace(/[*`]/g, "")
          .split("|")
          .slice(1, 3)
          .filter(e => e !== " -- ")
          .map(e => e.trim());
    
        if (chunks.length) {
          data.push(chunks);
        }
      }
      
      console.log(data);
    }
    

    Output:

    Logging output too large. Truncating output. [
      [ 'Google OAuth API Scope', 'Scope Description' ],
      [ 'Cloud SQL Admin API v1beta4', '' ],
      [ 'View and manage your data across Google Cloud Platform services',
        'https://www.googleapis.com/auth/cloud-platform' ],
      [ 'Manage your Google SQL Service instances',
        'https://www.googleapis.com/auth/sqlservice.admin' ],
      [ '', '' ],
      [ 'Android Management API v1', '' ],
      // ...
    

    Parsing the raw markdown is a bit hacky, but should be reliable enough. If it proves not to be, try one of the other options.


    If you’re not married to using GAS, your original code works for me in Node 20.11.1:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12 or rc.10
    
    const url = "https://github.com/labnol/apps-script-starter/blob/master/scopes.md";
    
    fetch(url)
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
        const data = $("tbody")
          .find("td")
          .toArray()
          .map(x => $(x).text());
        console.log(data);
      })
      .catch(err => console.error(err));
    

    Output:

    [
      'Cloud SQL Admin API v1beta4',
      '',
      'View and manage your data across Google Cloud Platform services',
      'https://www.googleapis.com/auth/cloud-platform',
      'Manage your Google SQL Service instances',
      'https://www.googleapis.com/auth/sqlservice.admin',
      '',
      '',
      'Android Management API v1',
      // ... 1360 total items ...
    ]
    

    Although this works, the array shown above is too flat to be usable–essentially one giant row. I would use a nested row and cell based scrape to preserve the tabular nature of the data and avoid flattening it out.

    // ...
    const $ = cheerio.load(html);
    const data = [...$("tr")].map(e =>
      [...$(e).find("td, th")].map(e => $(e).text().slice(0, 25))
    );
    console.table(data.slice(0, 10));
    // ...
    

    Here’s the output, which is similar to the GAS script output (remove the slice calls to see all of the data, without truncation):

    ┌─────────┬─────────────────────────────┬─────────────────────────────┐
    │ (index) │ 0                           │ 1                           │
    ├─────────┼─────────────────────────────┼─────────────────────────────┤
    │ 0       │ 'Google OAuth API Scope'    │ 'Scope Description'         │
    │ 1       │ 'Cloud SQL Admin API v1bet' │ ''                          │
    │ 2       │ 'View and manage your data' │ 'https://www.googleapis.co' │
    │ 3       │ 'Manage your Google SQL Se' │ 'https://www.googleapis.co' │
    │ 4       │ ''                          │ ''                          │
    │ 5       │ 'Android Management API v1' │ ''                          │
    │ 6       │ 'Manage Android devices an' │ 'https://www.googleapis.co' │
    │ 7       │ ''                          │ ''                          │
    │ 8       │ 'YouTube Data API v3'       │ ''                          │
    │ 9       │ 'Manage your YouTube accou' │ 'https://www.googleapis.co' │
    └─────────┴─────────────────────────────┴─────────────────────────────┘
    

    You can process this further to group on sub-categories. Rows with two empty cells are a delimiter between a scope category (I think–I’m not a domain expert), while rows with an empty right cell are category headers. Here’s an example that groups by sub-categories and attaches headers to each cell:

    const grouped = [];
    const headers = data[0];
    
    for (const row of data.slice(1)) {
      if (row.every(e => e === "")) {
        continue;
      } else if (row[1] === "") {
        grouped.push({title: row[0], items: []});
      } else {
        grouped
          .at(-1)
          .items.push(
            Object.fromEntries(
              row.map((e, i) => [headers[i], e])
            )
          );
      }
    }
    
    console.log(JSON.stringify(grouped, null, 2));
    

    I tested this example processing code in both GAS and Node.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search