Use PHP regex to search and create an array from an XML document only containing top level tags with existing order

EnginYilmaz
May 2, 2023
189 views
0 votes
2 Answers

I have a PHP variable with a text string of XML. I need to pass its top level tags e.g. <w:p> and <w:tbl> (pharagraphs and tables in existing order) into an array like this without their contents.

A sample array with expected result.

‘a pharagraph’
‘a table’
‘a pharagraph’

A sample PHP code so far I have done

<?php
 $text= <<<EOT
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
  <w:body>
    <w:p w:rsidR="00FC1847" w:rsidRDefault="00A526BC">
      <w:r>
        <w:t>foo</w:t>
      </w:r>
      <w:r w:rsidR="007C7582">
        <w:t>0</w:t>
      </w:r>
      <w:bookmarkStart w:id="0" w:name="_GoBack"/>
      <w:bookmarkEnd w:id="0"/>
    </w:p>
    <w:tbl>
      <w:tblPr>
        <w:tblStyle w:val="TabloKlavuzu"/>
        <w:tblW w:w="0" w:type="auto"/>
        <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
      </w:tblPr>
      <w:tblGrid>
        <w:gridCol w:w="11329"/>
      </w:tblGrid>
      <w:tr w:rsidR="00A526BC" w:rsidTr="00A526BC">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="11329" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
            <w:r>
              <w:t>bar</w:t>
            </w:r>
          </w:p>
        </w:tc>
      </w:tr>
    </w:tbl>
    <w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
      <w:r>
        <w:t>baz</w:t>
      </w:r>
    </w:p>
    <w:sectPr w:rsidR="00A526BC" w:rsidSect="00A526BC">
      <w:pgSz w:w="11907" w:h="16839" w:code="9"/>
      <w:pgMar w:top="459" w:right="284" w:bottom="1418" w:left="284" w:header="709" w:footer="709" w:gutter="0"/>
      <w:cols w:space="708"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>
EOT;
preg_match_all('%<w:p .*?>(.*?<w:r>.*?</w:r>).*?</w:p>%si', $text, $matches);
print_r($matches[1]);

which results in

Array
(
    [0] => <w:r><w:t>foo</w:t></w:r>
    [1] => <w:r><w:t>bar</w:t></w:r>
    [2] => <w:r><w:t>baz</w:t></w:r>
)

Answers

- Thefourthbird
- May 2, 2023 at 10:34 am
- 0 votes
0
Instead of using a regex, you can use DOMDocument and DOMXPath to get all the child elements using an xpath expression /w:document/w:body/* and then check for example the nodeName:
```
$dom = new DOMDocument();
$dom->loadXML($text);

$xpath = new DOMXPath($dom);
$elms = [];
foreach ($xpath->query('/w:document/w:body/*') as $node) {
    if ($node->nodeName === "w:p") {
        $elms[] = "a paragraph";
    }
    if ($node->nodeName === "w:tbl") {
        $elms[] = "a table";
    }    
}

print_r($elms);
```
Output
```
Array
(
    [0] => a paragraph
    [1] => a table
    [2] => a paragraph
)
```
PHP demo.
Login or Signup to reply.

Don’t use RegEx to parse XML, use an XML parser. Xpath expression allow you to fetch specific nodes.

Be aware that namespace prefixes/aliases are only valid for an element and its descendants until redefined. They can change between documents and even on a descendant node in the same document. The namespace URI is the unique identifier of an namespace. The prefix/alias is just for human readability/document size. I will use a prefix different from the one in the document in the example.

So you are looking for the child elements inside the {http://schemas.openxmlformats.org/wordprocessingml/2006/main}body element node (Clark notation).

Load the data into an DOM document, register a prefix for the namespace and fetch the elements using an Xpath expression.

$document = new DOMDocument();
$document->loadXML(getXMLString());
$xpath = new DOMXpath($document);
// register the namespace using your own prefix
$xpath->registerNamespace('main', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main');

$result = [];
// fetch children of the body element
foreach ($xpath->evaluate('//main:body/*') as $mainElement) {
    // store namespace and node name
    $result[] = [
        'uri' => $mainElement->namespaceURI,
        'name' => $mainElement->localName
    ];
}
var_dump($result);

Output:

array(4) {
  [0]=>
  array(2) {
    ["uri"]=>
    string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    ["name"]=>
    string(1) "p"
  }
  [1]=>
  array(2) {
    ["uri"]=>
    string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    ["name"]=>
    string(3) "tbl"
  }
  [2]=>
  array(2) {
    ["uri"]=>
    string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    ["name"]=>
    string(1) "p"
  }
  [3]=>
  array(2) {
    ["uri"]=>
    string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    ["name"]=>
    string(6) "sectPr"
  }
}

Please signup or login to give your own answer.

Click here to cancel reply.