skip to Main Content

I am creating a scraper in Node JS and I want it to look for all .css files.

I’m passing the HTML of the page as a string and simply using indexOf() to look for instances of .css, eg:

const searchHTMLIndex = htmlString.indexOf(".css");
if (searchHTMLIndex > 0) {
          let tempString = htmlString.substring(0, searchHTMLIndex);
          let lineNumber = tempString.split('n').length;
          jsonObj[getPageId] = pageObj;
          pageObj.pageUrl = url;
          return pageObj.searchTerm[item] = "CSS on line number: " + lineNumber;
}

However, I’d like to get the full CSS file name (and full path) if possible, eg: /assets/css/myCSSfile.css.

How do I get the preceding characters of a given string (up until, say " or =)?

2

Answers


  1. Use jsdom to parse the HTML:

    https://github.com/jsdom/jsdom

    import {JSDOM} from 'jsdom';
    
    const dom = new JSDOM(htmlString);
    const cssUrls = [...dom.window.document.querySelectorAll('link[rel=stylesheet]')].map(link => link.href);
    
    Login or Signup to reply.
  2. You could a regexp to extract href from <link rel="stylesheet" href="URL">:

    const htmlString = `
        <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Shared/stacks.css?v=312b43e78b51">
        <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Sites/stackoverflow/primary.css?v=134475a13287">
        <link type="text/css" href="https://cdn.sstatic.net/Shared/Channels/channels.css?v=a4d77abedec3" rel="stylesheet">
    `;
      
    const cssUrls = htmlString.match(/(?<=<link[^>]*(rel="stylesheet")?[^>]+href=")[^"]+(?=([^>]*rel="stylesheet")?)/g);
    console.log(cssUrls);
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search