I have a PHP variable with a text string of XML. I need to pass its top level tags e.g. <w:p> and <w:tbl> (pharagraphs and tables in existing order) into an array like this without their contents.
A sample array with expected result.
- ‘a pharagraph’
- ‘a table’
- ‘a pharagraph’
A sample PHP code so far I have done
<?php
$text= <<<EOT
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00FC1847" w:rsidRDefault="00A526BC">
<w:r>
<w:t>foo</w:t>
</w:r>
<w:r w:rsidR="007C7582">
<w:t>0</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:tbl>
<w:tblPr>
<w:tblStyle w:val="TabloKlavuzu"/>
<w:tblW w:w="0" w:type="auto"/>
<w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="11329"/>
</w:tblGrid>
<w:tr w:rsidR="00A526BC" w:rsidTr="00A526BC">
<w:tc>
<w:tcPr>
<w:tcW w:w="11329" w:type="dxa"/>
</w:tcPr>
<w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
<w:r>
<w:t>bar</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
<w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
<w:r>
<w:t>baz</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidR="00A526BC" w:rsidSect="00A526BC">
<w:pgSz w:w="11907" w:h="16839" w:code="9"/>
<w:pgMar w:top="459" w:right="284" w:bottom="1418" w:left="284" w:header="709" w:footer="709" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
EOT;
preg_match_all('%<w:p .*?>(.*?<w:r>.*?</w:r>).*?</w:p>%si', $text, $matches);
print_r($matches[1]);
which results in
Array
(
[0] => <w:r><w:t>foo</w:t></w:r>
[1] => <w:r><w:t>bar</w:t></w:r>
[2] => <w:r><w:t>baz</w:t></w:r>
)
2
Answers
Instead of using a regex, you can use DOMDocument and DOMXPath to get all the child elements using an xpath expression
/w:document/w:body/*
and then check for example the nodeName:Output
PHP demo.
Don’t use RegEx to parse XML, use an XML parser. Xpath expression allow you to fetch specific nodes.
Be aware that namespace prefixes/aliases are only valid for an element and its descendants until redefined. They can change between documents and even on a descendant node in the same document. The namespace URI is the unique identifier of an namespace. The prefix/alias is just for human readability/document size. I will use a prefix different from the one in the document in the example.
So you are looking for the child elements inside the
{http://schemas.openxmlformats.org/wordprocessingml/2006/main}body
element node (Clark notation).Load the data into an DOM document, register a prefix for the namespace and fetch the elements using an Xpath expression.
Output: