skip to Main Content

How can I extract a variable from a script tag of the page from a returned HTML Page in Javasc./Typescript?

My API request to the Server:
const response = await fetch( ... )

The response contains a big HTML Page, here just an example:

<!DOCTYPE html>
<html lang="de">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Steam App Daten</title>
</head>
<body>

    <h1>Willkommen auf der Seite für Steam App Daten</h1>

    <script type="text/javascript">
        var g_rgAppContextData = {
            "730": {
                "appid": 730,
                "name": "Counter-Strike 2",
                "icon": "https://cdn.fastly.steamstatic.com/steamcommunity/public/images/apps/730/8dbc71957312bbd3baea65848b545be9eae2a355.jpg",
                "link": "https://steamcommunity.com/app/730"
            }
        };
        var g_rgCurrency = [];
    </script>

</body>
</html>

I only want to extract the Variable g_rgAppContextData without anything else. I know, that i can select the script tag with getElementsByTagName("script") but what if there are 2 script tags? And how to select only the Variable?

2

Answers


  1. I guess what you need is a structure like this, where you define a global variable and when the DOM has been loaded, fetch the JSON from the server and display some of the data.

    <!DOCTYPE html>
    <html lang="de">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Steam App Daten</title>
        <script type="text/javascript">
          // create the global variable
          var g_rgAppContextData;
          
          // when the DOM has loaded, do stuff
          document.addEventListener('DOMContentLoaded', async e => {
            // fake URL to show what the JSON looks like in the response
            // this would be an ordinary URL in production
            let url = 'data:application/json;base64,eyI3MzAiOiB7CiAgImFwcGlkIjogNzMwLAogICJuYW1lIjogIkNvdW50ZXItU3RyaWtlIDIiLAogICJpY29uIjogImh0dHBzOi8vY2RuLmZhc3RseS5zdGVhbXN0YXRpYy5jb20vc3RlYW1jb21tdW5pdHkvcHVibGljL2ltYWdlcy9hcHBzLzczMC84ZGJjNzE5NTczMTJiYmQzYmFlYTY1ODQ4YjU0NWJlOWVhZTJhMzU1LmpwZyIsCiAgImxpbmsiOiAiaHR0cHM6Ly9zdGVhbWNvbW11bml0eS5jb20vYXBwLzczMCIKICB9Cn0=';
            
            // setting the global variable
            g_rgAppContextData = await fetch(url).then(response => response.json());
            
            // call some function 
            displayinfo();        
          });
          
          function displayinfo(){
            // geting the name from the global variable
            document.querySelector('p').textContent = g_rgAppContextData[730].name;
          }
        </script>
    </head>
    <body>
        <h1>Willkommen auf der Seite für Steam App Daten</h1>
        <p></p>
    </body>
    </html>
    Login or Signup to reply.
  2. Since the pages you want to scrape follow a certain pattern, it seems possible to make a number of simplifying assumptions about the structure of the returned HTML:

    • The desired variable is assigned a constant value in JSON format (in particular, member names like "730" are quoted).
    • The HTML page contains only one assignment for this variable.
    • A semicolon follows immediately after the closing }.
    • The member names and string values do not contain the sequence };.

    Let me know if these assumptions are not justified in your case.

    Under these assumptions, you can extract the variable value with a regular expression and parse it as JSON:

    const response = await fetch("...");
    const html = await response.text();
    const g_rgAppContextData = JSON.parse(
      html.match(/g_rgAppContextDatas*=s*({.*?});/s)[1]
    );
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search