skip to Main Content

I’m trying to get the contents of the paragraph of the following html:

 <h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>

There’s several h4s, but only one with the class synopsis.

I’m able to get the h4 element with print_r($xpath->query("//h4[contains(@class, 'synopsis')]")); but I’m unable to get the child paragraph contents.

What am I doing wrong?

2

Answers


  1. If

    //h4[contains(@class, 'synopsis')]
    

    selects the desired h4 elements, then

    //h4[contains(@class, 'synopsis')]/p
    

    will select the children p elements of the desired h4 elements, and

    //h4[contains(@class, 'synopsis')]/p/text()
    

    will select the text node children of those p elements.

    You can obtain the string value of a node via string():

    string(//h4[contains(@class, 'synopsis')]/p)
    

    Note that the above assumes XPath 1.0 (or that there will only be one such p), where the string value of the first node of the node set selected by //h4/p will be returned. Passing a sequence of nodes to string() would be an error in XPath 2.0 and higher, where instead you should use:

    string((//h4[contains(@class, 'synopsis')]/p)[0])
    

    if there could be more than one such p, or

    //h4[contains(@class, 'synopsis')]/p/string()
    

    if you’d like the string values of all such p elements returned.


    Example HTML

    <!doctype html>
    <html>
    <head>
      <title>p is not allowed in h4...</title>
    </head>
    <body>
      <h4><p>...but can still be selected via XPath</p></h4>
    </body>
    </html>
    

    XPath selection example

    $x("//h4/p")
    [p]
    $x("string(//h4/p)")
    '...but can still be selected via XPath'
    
    Login or Signup to reply.
  2. h4 can not contain p. PHPs DOMDocument will try to fix the HTML:

    $document = new DOMDocument();
    $document->loadHTML(
      '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
    );
    echo $document->saveHTML();
    
    Warning: DOMDocument::loadHTML(): Unexpected end tag : h4 in Entity, line: 1 in /in/BTVAZ on line 4
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><h4 class="m-b-0 text-dark synopsis"></h4><p>This is the text I want.</p></body></html>
    

    This can be mostly avoided with some loading flags:

    $document = new DOMDocument();
    $document->loadHTML(
      '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
      LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR
    );
    echo $document->saveHTML();
    
    <h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>
    

    The class attribute value consists of tokens separated by whitespace. A simple contains() will match the string if it is part of another class name.

    To match them with Xpath 1.0, use normalize-space() and concat(). The idea is to convert the attribute value to {space}classOne{space}classTwo{space} and match them against {space}classOne{space}.

    • Replace all white space sequences with a single space and trim the value.
      normalize-space(@class).
    • Add a spaces at start/end:
      concat(' ', normalize-space(@class), ' ')
    • Look for class name surrounded by space:
      [contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]
    • Match any element node with the class:
      //*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]
    • Cast first node to string:
      string(//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')])
    $document = new DOMDocument();
    $document->loadHTML(
      '<h4 class="m-b-0 text-dark synopsis"><p>This is the text I want.</p></h4>',
      LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR
    );
    $xpath = new DOMXpath($document);
    
    $expression = "string(//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')])";
    
    var_dump($xpath->evaluate($expression));
    

    Output:

    string(24) "This is the text I want."
    

    If you try to fetch multiple nodes, remove the string cast in Xpath. The expression will return a node list. Iterate the nodes and read the $textContent property. It will contain the contents of all descendant text nodes.

    $expression = "//*[contains(concat(' ', normalize-space(@class), ' '), ' synopsis ')]";
    
    foreach ($xpath->evaluate($expression) as $synopsis) {
        var_dump($synopsis->textContent);
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search