skip to Main Content

I am trying to build a chrome extension that aggregates information from a bunch of sites when the user visits a site A


async function fetchHTML(url) {
    const response = await fetch(proxyUrl + url);
    const html = await response.text();
    console.log(html);
    return html;
  }

  // Function to extract the element - total violations from the HTML content
  function extractTotalViolations(html) {
    const parser = new DOMParser();
    const doc = parser.parseFromString(html, "text/html");
    const totalViolations = doc.querySelector(".total-violations").textContent;
    return totalViolations;
  }
  
  // The URL of the page we want to scrape
  const url = "https://whoownswhat.justfix.org/en/address/MANHATTAN/610/EAST%2020%20STREET";
  
  // Fetch the HTML content of the page and extract the total violations
  fetchHTML(url).then(html => {
    const totalViolations = extractTotalViolations(html);
    console.log(totalViolations);
  });

When I print totalViolations, I get NULL. So I printed the HTML that was fetched & I realized that I am getting some javascript code that doesn’t look anything like the HTML code I see on the website directly. I suspect the website is using some javascript masking or maybe I am not fetching the HTML correctly

<script>
!function(e){function t(t){for(var n,l,i=t[0],f=t[1],a=t[2],p=0,s=[];p<i.length;p++)l=i[p],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in f)Object.prototype.hasOwnProperty.call(f,n)&&(e[n]=f[n]);for(c&&c(t);s.length;)s.shift()();return u.push.apply(u,a||[]),r()}function r(){for(var e,t=0;t<u.length;t++){for(var r=u[t],n=!0,i=1;i<r.length;i++){var f=r[i];0!==o[f]&&(n=!1)}n&&(u.splice(t--,1),e=l(l.s=r[0]))}return e}var n={},o={1:0},u=[];function l(t){if(n[t])return n[t].exports;var r=n[t]={i:t,l:!1,exports:{}};return e[t].call(r.exports,r,r.exports,l),r.l=!0,r.exports}l.m=e,l.c=n,l.d=function(e,t,r){l.o(e,t)||Object.defineProperty(e,t,{enumerable:!0,get:r})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",
</script>

My question is how can I extract the HTMl correctly so that I can parse the DOM & get all the information from this site that I want to put on the extension. Thanks

2

Answers


  1. The fact that you’ve got Javascript as a response proves that:

    • the request was correct
    • you received a response

    which means that you need to load the page while your browser’s Dev Tools are open and carefully study the requests that are being sent. Based on your description it’s likely that the first request being sent when you visit the page will load a Javascript code, which then is processed and sends further requests to the server. Carefully study the requests, along with their URLs, request headers and payloads as well as the responses.

    You will need to replicate the request sending and you will also need to parse the response. If the response will end up being some HTML, then you can parse it in the way you already tried to parse (with the change being effected on where and how the request or requests are being sent), otherwise, if the response is not HTML, but something else, such as JSON, then carefully study the HTML that ends up being displayed on the target site and implement a code that converts the raw server response into a similar HTML code.

    Login or Signup to reply.
  2. You will have to delve a bit deeper into fetching resources to get what you’re looking for. The URL in question loads content dynamically, likely to make scraping content an inconvenience… But nothing is perfect.

    This URL is requested without any key or credentials and seems to contain the information you’re looking for.

    As others have said, pull out devTools and use the network tab to watch how the page loads its resources. It will help get you a lot closer to the data you’re looking for.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search