skip to Main Content

I’m trying to access all the text within the Text node of the following XML document:

<Section>
 <Subsection lims:inforce-start-date="2003-07-01" lims:fid="182941" lims:id="182941">
  <Label>(2)</Label>
  <Text>
   In subsection (1),
    <DefinedTermEn>beer</DefinedTermEn>
   and
    <DefinedTermEn>malt liquor</DefinedTermEn>
   have the meaning assigned by section 4.
   </Text>
 </Subsection>
</Section>

With Xpath, using $xml->xpath("Body/Section/Subsection") will return the following:

object(SimpleXMLElement)#7 (3) {
    
    ["Label"]=>
    string(3) "(2)"
    ["Text"]=>
    string(64) "In subsection (1),  and  have the meaning assigned by section 4."

Which makes the inner node disappear. Is there a way to "flatten" all the content of all the subnodes within a node so that I can get a continuous piece of text?
e. g. In subsection (1), beer and malt liquor have the meaning assigned by section 4.

2

Answers


  1. Mixed nodes are to complex for SimpleXML – use DOM. The DOMNode::$textContent property will return the text content of any node. For element nodes this includes the text content of any descendant node. Also DOMXpath::evaluate() supports expression that return scalar values. If you cast a node list into a string it will return the text content of the first node in the list.

    // bootstrap DOM
    $document = new DOMDocument();
    $document->loadXML(getXML());
    $xpath = new DOMXpath($document);
    
    // iterate the subsection element nodes
    foreach ($xpath->evaluate('//Subsection') as $subsection) {
        var_dump(
            [
                // text content of the "Label" child element
                'label' => $xpath->evaluate('string(Label)', $subsection),
                // text content of the "Text" child element
                'text' => $xpath->evaluate('string(Text)', $subsection),
            ]
        );
    }
    
    function getXML() {
      return <<<'XML'
    <Section xmlns:lims="urn:lims">
     <Subsection lims:inforce-start-date="2003-07-01" lims:fid="182941" lims:id="182941">
      <Label>(2)</Label>
      <Text>
       In subsection (1),
        <DefinedTermEn>beer</DefinedTermEn>
       and
        <DefinedTermEn>malt liquor</DefinedTermEn>
       have the meaning assigned by section 4.
       </Text>
     </Subsection>
    </Section>
    XML;
    }
    

    Output:

    array(2) {
      ["label"]=>
      string(3) "(2)"
      ["text"]=>
      string(101) "
       In subsection (1),
        beer
       and
        malt liquor
       have the meaning assigned by section 4.
       "
    }
    
    Login or Signup to reply.
  2. The answer @ThW posted explains how DOM is a better fit for this, however that approach may leave you with a whitespace problem. You may want to think about writing a function to recurse the node tree within your Text element and build a string that trims the whitespace from each text node, leaving you with a single line.

    <?php
    
    $input = <<<END
    <Body>
    <Section>
     <Subsection lims:inforce-start-date="2003-07-01" lims:fid="182941" lims:id="182941">
      <Label>(2)</Label>
      <Text>
       In subsection (1),
        <DefinedTermEn>beer</DefinedTermEn>
       and
        <DefinedTermEn>malt liquor</DefinedTermEn>
       have the meaning assigned by section 4.
       </Text>
     </Subsection>
    </Section>
    </Body>
    END;
    
    // Create DOMDocument instance and load XML
    $dom = new DOMDocument();
    $dom->loadXML($input);
    
    // Instantiate XPath with our document
    $xpath = new DOMXPath($dom);
    
    // Get the Text elements
    $textElements = $xpath->query("/Body/Section/Subsection/Text");
    
    /**
     * Recursive function to build a string from the text content within
     * a DOMNode and its children. Whitespace is trimmed.
     *
     * @param DOMNode $node
     * @return string
     */
    function getBranchText(DOMNode $node ) : string
    {
        $buffer = [];
    
        if($node->nodeType == XML_TEXT_NODE || $node->nodeType == XML_CDATA_SECTION_NODE)
        {
            $buffer[] = trim($node->nodeValue);
        }
        elseif ($node->nodeType == XML_ELEMENT_NODE)
        {
            foreach($node->childNodes as $currChild)
            {
                $buffer[] = getBranchText($currChild);
            }
        }
    
        return implode(' ', $buffer);
    }
    
    $output = getBranchText($textElements[0]);
    
    echo $output.PHP_EOL;
    

    Output:

    In subsection (1), beer and malt liquor have the meaning assigned by section 4.
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search