I would like to replace instances of "n", "t", and " " (four spaces) in RegEx, but preserve all whitespace inside of an HTML comment block. Unfortunately, the comment can contain anything, including other HTML tags, so I have to match "<!–" and "–>" specifically. Furthermore, there may be multiple instances of comments with whitespace to match in between. I can use multiple RegEx expressions if needed, but I cannot modify the HTML content aside from the replacement.
Here is some sample code to experiment with:
<div>
<p>Sample text!</p>
<!--
<img src="test.jpg" alt="This is an image!" width="500" height="600">
-->
</div>
<div>
<p>Sample text!</p>
<!--
<img src="test.jpg" alt="This is an image!" width="500" height="600">
-->
</div>
<div>
<p>Sample text!</p>
<!--
<img src="test.jpg" alt="This is an image!" width="500" height="600">
-->
</div>
In this instance, all sets of four spaces should be matched except for the ones in each comment (lines 4, 5, 10, 11, 16, 17).
I have already split up my expressions into one for each type of whitespace, and I have been experimenting with spaces. The closest I have gotten is this:
/(?<!<!--.*?(?<!-->.*?)) (?!(?!.*?<!--).*?-->)/gs
which matches instances of tabs not in the first or last comment block, but it does match tabs in the middle comment blocks which is incorrect. However I suspect it could be accomplished by modifying something in the second half:
/ (?!(?!.*?<!--).*?-->)/gs
Any suggestions? Is this even possible?
2
Answers
The node interface is an approach you can try. Parse your string to a html fragment and walk through the nodes cleaning them up.
If you need to handle more types of nodes you can add them to the switch and refer to the list here
Instead of lookarounds, you could match the comments first and then keep them as is. Then alternatively remove all 4 white spaces.
Such as
and replace it with
$1
.See the test case