I’m creating snapshots of my website pages with PhantomJS, and I would like to remove script tags from the genrated html snapshots, but I would like to keep them when their type is “application/ld+json” for SEO purpose.
I know how to remove all the script tags (content=the content of the html snapshot):
content.replace(/<scriptb[^<]*(?:(?!</script>)<[^<]*)*</script>/gi, "") ;
I would like that the above code will be useful for someone and to know how to the change regex above to let it keep the script tags which the type is “application/ld+json” or do it other way than regex.
Example:
<head>........
<script type="application/ld+json">
{ "@context" : "http://schema.org",
"@type" : "Organization",
"name" : "MyOrg",
"url" : "https://www.myorg.com",
}
</script>
....
</head>........
2
Answers
I haven’t actually used PhantomJS before, but it looks like you can manipulate the DOM after retrieving the page using
page.evaluate()
. Maybe removing appropriatescript
elements could be done using DOM API rather than regex? e.g.I downloaded PhantomJS and did a quick test, seems to work 🙂 Here’s what I used:
I’m not very good in regular expressions, but i think you should use negative lookahead.
For example the regexp
first(?!second)
matches “first” just in case that it’s not followed by “second”.So in your case the regexp will be the next one:
I know that negative lookahead may be confusing (and I’m a bad teacher), but don’t worry, many people can’t understand it well (me too : ) ).
Hope my answer will help you to solve your problems. Good luck!