skip to Main Content

I would like to replace instances of "n", "t", and " " (four spaces) in RegEx, but preserve all whitespace inside of an HTML comment block. Unfortunately, the comment can contain anything, including other HTML tags, so I have to match "<!–" and "–>" specifically. Furthermore, there may be multiple instances of comments with whitespace to match in between. I can use multiple RegEx expressions if needed, but I cannot modify the HTML content aside from the replacement.

Here is some sample code to experiment with:

<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>

In this instance, all sets of four spaces should be matched except for the ones in each comment (lines 4, 5, 10, 11, 16, 17).

I have already split up my expressions into one for each type of whitespace, and I have been experimenting with spaces. The closest I have gotten is this:

/(?<!<!--.*?(?<!-->.*?))    (?!(?!.*?<!--).*?-->)/gs

which matches instances of tabs not in the first or last comment block, but it does match tabs in the middle comment blocks which is incorrect. However I suspect it could be accomplished by modifying something in the second half:

/    (?!(?!.*?<!--).*?-->)/gs

Any suggestions? Is this even possible?

2

Answers


  1. The node interface is an approach you can try. Parse your string to a html fragment and walk through the nodes cleaning them up.

    If you need to handle more types of nodes you can add them to the switch and refer to the list here

    const str = `<div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>`;
    
    const transformNodes = (nodes) => {
      nodes.forEach(node => {
        switch (node.nodeType) {
          case Node.COMMENT_NODE:
            // don't change comments
          break;
          case Node.TEXT_NODE:
            // trim text nodes
            node.nodeValue = node.nodeValue.trim();
          break;
          case Node.ELEMENT_NODE:
            // recurse elements
            transformNodes(node.childNodes);
          break;
        }
      });
    };
    
    const el = document.createElement('body');
    el.innerHTML = str;
    transformNodes(el.childNodes);
    console.log(el.innerHTML);
    Login or Signup to reply.
  2. Instead of lookarounds, you could match the comments first and then keep them as is. Then alternatively remove all 4 white spaces.

    Such as

    /(<!--.*?-->)|    /gs
    

    and replace it with $1.

    See the test case

    const text = `<div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>`;
    
    
    const output = text.replace(/(<!--.*?-->)|    /gs, '$1');
    
    console.log(output);
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search