How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?

dedtis
June 27, 2023
185 views
0 votes
2 Answers

How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.

From the opposite, I succeeded, but as I looked above – no

<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14
<p>   text3   </p>
text2
</body>
</html>

This is what I got:

<[^>]+>(.+?)</[^>]+>

Answers

Assuming you wanted

["text1", "text2", "text14", "text3", "text2"]

then you can use parseFromString and createNodeIterator

and do this:

const htmlStr = `<html>
<body>
  <h1> text1</h1>
  <p>text2</p>
  text14
  <p> text3 </p>
  text2
</body>
</html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");

let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);

const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent.trim();
  if (text !== "") textArr.push(text);
}
console.log(textArr);

The requirements as in the OP’s own words …

"how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP’s expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job …

const code =
`<html>
  <body>
    <h1>foo</h1>  <!-- no pick ... not a single white space at all -->
    <p>  bar </p> <!-- pick... ... simple spaces only -->
    baz           <!-- no pick ... leading tab and new line -->
    <p>bizz</p>   <!-- no pick ... not a single white space at all -->
    buzz          <!-- no pick ... leading simple spaces and new line -->
    <p>booz  </p> <!-- pick... ... simple spaces only -->
  </body>
</html>`;

const dom = (new DOMParser)
  .parseFromString(code, 'text/html');

const textNodeIterator =
  document.createNodeIterator(
    dom.documentElement,
    NodeFilter.SHOW_TEXT,
    node => (
      (node.textContent.trim() !== '') && // - content other than just white space(s)
      (/s+/).test(node.textContent) &&   // - content with any kind of white space
      !(/[tn]+/).test(node.textContent) // - content without tabs and new lines
    )
    ? NodeFilter.FILTER_ACCEPT
    : NodeFilter.FILTER_REJECT
  );

const textContentList = [];
let textNode;

while (textNode = textNodeIterator.nextNode()) {
  textContentList.push(textNode.textContent)
}
console.log({ textContentList });

.as-console-wrapper { min-height: 100%!important; top: 0; }

Please signup or login to give your own answer.

Click here to cancel reply.