How can I match whitespace outside of HTML comments with RegEx?

Jonathan
December 11, 2024
176 views
0 votes
2 Answers

I would like to replace instances of "n", "t", and " " (four spaces) in RegEx, but preserve all whitespace inside of an HTML comment block. Unfortunately, the comment can contain anything, including other HTML tags, so I have to match "<!–" and "–>" specifically. Furthermore, there may be multiple instances of comments with whitespace to match in between. I can use multiple RegEx expressions if needed, but I cannot modify the HTML content aside from the replacement.

Here is some sample code to experiment with:

<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>

In this instance, all sets of four spaces should be matched except for the ones in each comment (lines 4, 5, 10, 11, 16, 17).

I have already split up my expressions into one for each type of whitespace, and I have been experimenting with spaces. The closest I have gotten is this:

/(?<!<!--.*?(?<!-->.*?))    (?!(?!.*?<!--).*?-->)/gs

which matches instances of tabs not in the first or last comment block, but it does match tabs in the middle comment blocks which is incorrect. However I suspect it could be accomplished by modifying something in the second half:

/    (?!(?!.*?<!--).*?-->)/gs

Any suggestions? Is this even possible?

Answers

The node interface is an approach you can try. Parse your string to a html fragment and walk through the nodes cleaning them up.

If you need to handle more types of nodes you can add them to the switch and refer to the list here

const str = `<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>`;

const transformNodes = (nodes) => {
  nodes.forEach(node => {
    switch (node.nodeType) {
      case Node.COMMENT_NODE:
        // don't change comments
      break;
      case Node.TEXT_NODE:
        // trim text nodes
        node.nodeValue = node.nodeValue.trim();
      break;
      case Node.ELEMENT_NODE:
        // recurse elements
        transformNodes(node.childNodes);
      break;
    }
  });
};

const el = document.createElement('body');
el.innerHTML = str;
transformNodes(el.childNodes);
console.log(el.innerHTML);

Instead of lookarounds, you could match the comments first and then keep them as is. Then alternatively remove all 4 white spaces.

Such as

/(<!--.*?-->)|    /gs

and replace it with $1.

See the test case

const text = `<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>`;


const output = text.replace(/(<!--.*?-->)|    /gs, '$1');

console.log(output);

Please signup or login to give your own answer.

Click here to cancel reply.