PhantomJS remove script tags from html snapshot except "json+ld" - SEO

YahyaYahyaoui
October 14, 2015
156 views
0 votes
2 Answers

I’m creating snapshots of my website pages with PhantomJS, and I would like to remove script tags from the genrated html snapshots, but I would like to keep them when their type is “application/ld+json” for SEO purpose.

I know how to remove all the script tags (content=the content of the html snapshot):

content.replace(/<scriptb[^<]*(?:(?!</script>)<[^<]*)*</script>/gi, "") ;

I would like that the above code will be useful for someone and to know how to the change regex above to let it keep the script tags which the type is “application/ld+json” or do it other way than regex.

Example:

<head>........
    <script type="application/ld+json">
        { "@context" : "http://schema.org",
          "@type" : "Organization",
          "name" : "MyOrg",
          "url" : "https://www.myorg.com",
        }
    </script>
....
</head>........

Answers

I haven’t actually used PhantomJS before, but it looks like you can manipulate the DOM after retrieving the page using page.evaluate(). Maybe removing appropriate script elements could be done using DOM API rather than regex? e.g.

page.evaluate(function() {
    Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
        return script.type != "application/ld+json";
    }).forEach(function(script) {
        script.parentNode.removeChild(script);
    });
    return document.documentElement.outerHTML; // or whatever is appropriate
})

I downloaded PhantomJS and did a quick test, seems to work 🙂 Here’s what I used:

var fs = require('fs');
var page = require('webpage').create();
page.open('...', function(status) {
    if(status === "success") {
        var result = page.evaluate(function(success) {
            Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
                return script.type != "application/ld+json";
            }).forEach(function(script) {
                script.parentNode.removeChild(script);
            });
            return document.documentElement.outerHTML;
        });

        fs.write("output.html", result, "w");
    }

    phantom.exit();
});

- EugeneKorobov
- October 15, 2015 at 12:29 am
- 0 votes
0
I’m not very good in regular expressions, but i think you should use negative lookahead.

Negative lookahead with ?! peeks ahead to ensure that its subpattern could not possibly match. (http://docs.racket-lang.org/guide/Looking_Ahead_and_Behind.html)

For example the regexp first(?!second) matches “first” just in case that it’s not followed by “second”.

So in your case the regexp will be the next one:
```
content.replace(
    /
        <script type= // starting the regexp, trying to find *this* string
        (?!"application/ld+json") // if *this* string isn't followed by good id, then take it
        .+? // and take everything after *this* string (questionmark stops taking after first match)
        </script> // and the end tag
    /gi, '') // replacing all *bad* id's with empty string
```
I know that negative lookahead may be confusing (and I’m a bad teacher), but don’t worry, many people can’t understand it well (me too : ) ).

Hope my answer will help you to solve your problems. Good luck!
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

PhantomJS remove script tags from html snapshot except "json+ld" – SEO

Answers