skip to Main Content

How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.

From the opposite, I succeeded, but as I looked above – no

<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14
<p>   text3   </p>
text2
</body>
</html>

This is what I got:

<[^>]+>(.+?)</[^>]+>

2

Answers


  1. Assuming you wanted

    ["text1", "text2", "text14", "text3", "text2"]
    

    then you can use parseFromString and createNodeIterator

    and do this:

    const htmlStr = `<html>
    <body>
      <h1> text1</h1>
      <p>text2</p>
      text14
      <p> text3 </p>
      text2
    </body>
    </html>`
    const parser = new DOMParser();
    const dom = parser.parseFromString(htmlStr, "text/html");
    
    let currentNode,
      nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);
    
    const textArr = [];
    while (currentNode = nodeIterator.nextNode()) {
      const text = currentNode.textContent.trim();
      if (text !== "") textArr.push(text);
    }
    console.log(textArr);
    Login or Signup to reply.
  2. The requirements as in the OP’s own words …

    "how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

    The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP’s expected result.

    But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job …

    const code =
    `<html>
      <body>
        <h1>foo</h1>  <!-- no pick ... not a single white space at all -->
        <p>  bar </p> <!-- pick... ... simple spaces only -->
        baz           <!-- no pick ... leading tab and new line -->
        <p>bizz</p>   <!-- no pick ... not a single white space at all -->
        buzz          <!-- no pick ... leading simple spaces and new line -->
        <p>booz  </p> <!-- pick... ... simple spaces only -->
      </body>
    </html>`;
    
    const dom = (new DOMParser)
      .parseFromString(code, 'text/html');
    
    const textNodeIterator =
      document.createNodeIterator(
        dom.documentElement,
        NodeFilter.SHOW_TEXT,
        node => (
          (node.textContent.trim() !== '') && // - content other than just white space(s)
          (/s+/).test(node.textContent) &&   // - content with any kind of white space
          !(/[tn]+/).test(node.textContent) // - content without tabs and new lines
        )
        ? NodeFilter.FILTER_ACCEPT
        : NodeFilter.FILTER_REJECT
      );
    
    const textContentList = [];
    let textNode;
    
    while (textNode = textNodeIterator.nextNode()) {
      textContentList.push(textNode.textContent)
    }
    console.log({ textContentList });
    .as-console-wrapper { min-height: 100%!important; top: 0; }
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search