XPath Child Contents - Php

GFL
April 26, 2023
192 views
0 votes
2 Answers

I’m trying to get the contents of the paragraph of the following html:

 <h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>

There’s several h4s, but only one with the class synopsis.

I’m able to get the h4 element with print_r($xpath->query("//h4[contains(@class, 'synopsis')]")); but I’m unable to get the child paragraph contents.

What am I doing wrong?

Answers

- kjhughes
- April 26, 2023 at 5:09 am
- 0 votes
0
If
```
//h4[contains(@class, 'synopsis')]
```
selects the desired h4 elements, then
```
//h4[contains(@class, 'synopsis')]/p
```
will select the children p elements of the desired h4 elements, and
```
//h4[contains(@class, 'synopsis')]/p/text()
```
will select the text node children of those p elements.

You can obtain the string value of a node via string():
```
string(//h4[contains(@class, 'synopsis')]/p)
```
Note that the above assumes XPath 1.0 (or that there will only be one such p), where the string value of the first node of the node set selected by //h4/p will be returned. Passing a sequence of nodes to string() would be an error in XPath 2.0 and higher, where instead you should use:
```
string((//h4[contains(@class, 'synopsis')]/p)[0])
```
if there could be more than one such p, or
```
//h4[contains(@class, 'synopsis')]/p/string()
```
if you’d like the string values of all such p elements returned.

Example HTML
```
<!doctype html>
<html>
<head>
  <title>p is not allowed in h4...</title>
</head>
<body>
  <h4><p>...but can still be selected via XPath</p></h4>
</body>
</html>
```
XPath selection example
```
$x("//h4/p")
[p]
$x("string(//h4/p)")
'...but can still be selected via XPath'
```
Login or Signup to reply.

- ThW
- April 27, 2023 at 10:35 am
- 0 votes
0
h4 can not contain p. PHPs DOMDocument will try to fix the HTML:
```
$document = new DOMDocument();
$document->loadHTML(
  '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
);
echo $document->saveHTML();
```
```
Warning: DOMDocument::loadHTML(): Unexpected end tag : h4 in Entity, line: 1 in /in/BTVAZ on line 4
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><h4 class="m-b-0 text-dark synopsis"></h4><p>This is the text I want.</p></body></html>
```
This can be mostly avoided with some loading flags:
```
$document = new DOMDocument();
$document->loadHTML(
  '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
  LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR
);
echo $document->saveHTML();
```
```
<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>
```
The class attribute value consists of tokens separated by whitespace. A simple contains() will match the string if it is part of another class name.

To match them with Xpath 1.0, use normalize-space() and concat(). The idea is to convert the attribute value to {space}classOne{space}classTwo{space} and match them against {space}classOne{space}.
- Replace all white space sequences with a single space and trim the value.
  normalize-space(@class).
- Add a spaces at start/end:
  concat(' ', normalize-space(@class), ' ')
- Look for class name surrounded by space:
  [contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]
- Match any element node with the class:
  //*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]
- Cast first node to string:
  string(//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')])
```
$document = new DOMDocument();
$document->loadHTML(
  '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
  LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR
);
$xpath = new DOMXpath($document);

$expression = "string(//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')])";

var_dump($xpath->evaluate($expression));
```
Output:
```
string(24) "This is the text I want."
```
If you try to fetch multiple nodes, remove the string cast in Xpath. The expression will return a node list. Iterate the nodes and read the $textContent property. It will contain the contents of all descendant text nodes.
```
$expression = "//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]";

foreach ($xpath->evaluate($expression) as $synopsis) {
    var_dump($synopsis->textContent);
}
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

XPath Child Contents – Php

Answers

Example HTML

XPath selection example