Html - Xpath: Extract text between tags, but stop as soon as an embedded tag occurs

RicoSonntag
April 2, 2024
175 views
0 votes
2 Answers

I would like to extract the text within the following HTML. However, everything that occurs within an enclosed HTML tag and everything that comes after it should be ignored.

The HTML appears in different forms.

<span class="classA">Text 1 <span class="classB">Text 2</span> Text 3 <span class="classC">Text 4</span> Text 5</span>

Desired result: "Text 1 Text 2 Text 3"

Other variants:

<span class="classA">Text 1 <span class="classC">Text 2</span></span>
<span class="classA">Text 1 <span class="classC">Text 2</span> Text 3</span>
<span class="classA">Text 1</span>

Desired result: "Text 1"

<span class="classA">Text 1 <span class="classB">Text 2</span> Text 3</span>

Desired result: "Text 1 Text 2 Text3"

So everything after the occurrence of a span element with class "classC" should be ignored. It’s also possible that "classC" doesn’t appear at all.

I already tried //span[@class="classA"]//text()[parent::*[not(@class="classC")]], this ignores "classC" content, but returns the text after <span class="classC"> (Text 5 from the first example).

How can I achieve this?

Update:

With //span[@class="classC"]//parent::*/preceding::text() I’m getting a little closer to the matter. However, it still doesn’t work with <span class="classA">Text 1</span>, which returns noting.

Answers

<?php

$html = '<span class="classA">Text 1 <span class="classB">Text 2</span> Text 3 <span class="classC">Text 4</span> Text 5</span>';

// Create a DOMDocument object
$dom = new DOMDocument();

// Load the HTML string
$dom->loadHTML($html);

// Create a DOMXPath object
$xpath = new DOMXPath($dom);

// XPath query to select spans with class "classA" and ignore those with class "classC"
$query = '//span[@class="classA"]//text()[not(parent::span[@class="classC"])]';

// Execute the XPath query
$textNodes = $xpath->query($query);

// Initialize an array to store the extracted text
$textArray = [];

// Iterate through the text nodes and extract the text content
foreach ($textNodes as $node) {
    // Trim the whitespace and add the text content to the array
    $textArray[] = trim($node->nodeValue);
}

// Join the extracted text into a single string
$extractedText = implode(' ', $textArray);

// Output the extracted text
echo $extractedText;

?>

- MichaelKay
- April 2, 2024 at 1:06 pm
- 0 votes
0
You haven’t said which XPath version you are using. This is quite hard to achieve using XPath 1.0 which is all that PHP’s DOMXPath supports.

Logically you can do
```
.//text() except .//span[@class="ClassC"]/following::text()
```
but the except operator requires XPath 2.0. A workaround is that in XPath 1.0 you can rewrite (A except B) as A[count(.|B)!=count(B)] but it’s potentially very inefficient.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Xpath: Extract text between tags, but stop as soon as an embedded tag occurs

Answers