skip to Main Content

I’m creating snapshots of my website pages with PhantomJS, and I would like to remove script tags from the genrated html snapshots, but I would like to keep them when their type is “application/ld+json” for SEO purpose.

I know how to remove all the script tags (content=the content of the html snapshot):

content.replace(/<scriptb[^<]*(?:(?!</script>)<[^<]*)*</script>/gi, "") ;

I would like that the above code will be useful for someone and to know how to the change regex above to let it keep the script tags which the type is “application/ld+json” or do it other way than regex.

Example:

<head>........
    <script type="application/ld+json">
        { "@context" : "http://schema.org",
          "@type" : "Organization",
          "name" : "MyOrg",
          "url" : "https://www.myorg.com",
        }
    </script>
....
</head>........

2

Answers


  1. I haven’t actually used PhantomJS before, but it looks like you can manipulate the DOM after retrieving the page using page.evaluate(). Maybe removing appropriate script elements could be done using DOM API rather than regex? e.g.

    page.evaluate(function() {
        Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
            return script.type != "application/ld+json";
        }).forEach(function(script) {
            script.parentNode.removeChild(script);
        });
        return document.documentElement.outerHTML; // or whatever is appropriate
    })
    

    I downloaded PhantomJS and did a quick test, seems to work 🙂 Here’s what I used:

    var fs = require('fs');
    var page = require('webpage').create();
    page.open('...', function(status) {
        if(status === "success") {
            var result = page.evaluate(function(success) {
                Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
                    return script.type != "application/ld+json";
                }).forEach(function(script) {
                    script.parentNode.removeChild(script);
                });
                return document.documentElement.outerHTML;
            });
    
            fs.write("output.html", result, "w");
        }
    
        phantom.exit();
    });
    
    Login or Signup to reply.
  2. I’m not very good in regular expressions, but i think you should use negative lookahead.

    Negative lookahead with ?! peeks ahead to ensure that its subpattern could not possibly match. (http://docs.racket-lang.org/guide/Looking_Ahead_and_Behind.html)

    For example the regexp first(?!second) matches “first” just in case that it’s not followed by “second”.

    So in your case the regexp will be the next one:

    content.replace(
        /
            <script type= // starting the regexp, trying to find *this* string
            (?!"application/ld+json") // if *this* string isn't followed by good id, then take it
            .+? // and take everything after *this* string (questionmark stops taking after first match)
            </script> // and the end tag
        /gi, '') // replacing all *bad* id's with empty string
    

    I know that negative lookahead may be confusing (and I’m a bad teacher), but don’t worry, many people can’t understand it well (me too : ) ).

    Hope my answer will help you to solve your problems. Good luck!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search