skip to Main Content

a simple office document with containing 2 pharagraphs and a table between them

I have a PHP variable with a text string of XML. I need to pass its top level tags e.g. <w:p> and <w:tbl> (pharagraphs and tables in existing order) into an array like this without their contents.

A sample array with expected result.

  • ‘a pharagraph’
  • ‘a table’
  • ‘a pharagraph’

A sample PHP code so far I have done

<?php
 $text= <<<EOT
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
  <w:body>
    <w:p w:rsidR="00FC1847" w:rsidRDefault="00A526BC">
      <w:r>
        <w:t>foo</w:t>
      </w:r>
      <w:r w:rsidR="007C7582">
        <w:t>0</w:t>
      </w:r>
      <w:bookmarkStart w:id="0" w:name="_GoBack"/>
      <w:bookmarkEnd w:id="0"/>
    </w:p>
    <w:tbl>
      <w:tblPr>
        <w:tblStyle w:val="TabloKlavuzu"/>
        <w:tblW w:w="0" w:type="auto"/>
        <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
      </w:tblPr>
      <w:tblGrid>
        <w:gridCol w:w="11329"/>
      </w:tblGrid>
      <w:tr w:rsidR="00A526BC" w:rsidTr="00A526BC">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="11329" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
            <w:r>
              <w:t>bar</w:t>
            </w:r>
          </w:p>
        </w:tc>
      </w:tr>
    </w:tbl>
    <w:p w:rsidR="00A526BC" w:rsidRDefault="00A526BC">
      <w:r>
        <w:t>baz</w:t>
      </w:r>
    </w:p>
    <w:sectPr w:rsidR="00A526BC" w:rsidSect="00A526BC">
      <w:pgSz w:w="11907" w:h="16839" w:code="9"/>
      <w:pgMar w:top="459" w:right="284" w:bottom="1418" w:left="284" w:header="709" w:footer="709" w:gutter="0"/>
      <w:cols w:space="708"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>
EOT;
preg_match_all('%<w:p .*?>(.*?<w:r>.*?</w:r>).*?</w:p>%si', $text, $matches);
print_r($matches[1]);

which results in

Array
(
    [0] => <w:r><w:t>foo</w:t></w:r>
    [1] => <w:r><w:t>bar</w:t></w:r>
    [2] => <w:r><w:t>baz</w:t></w:r>
)

2

Answers


  1. Instead of using a regex, you can use DOMDocument and DOMXPath to get all the child elements using an xpath expression /w:document/w:body/* and then check for example the nodeName:

    $dom = new DOMDocument();
    $dom->loadXML($text);
    
    $xpath = new DOMXPath($dom);
    $elms = [];
    foreach ($xpath->query('/w:document/w:body/*') as $node) {
        if ($node->nodeName === "w:p") {
            $elms[] = "a paragraph";
        }
        if ($node->nodeName === "w:tbl") {
            $elms[] = "a table";
        }    
    }
    
    print_r($elms);
    

    Output

    Array
    (
        [0] => a paragraph
        [1] => a table
        [2] => a paragraph
    )
    

    PHP demo.

    Login or Signup to reply.
  2. Don’t use RegEx to parse XML, use an XML parser. Xpath expression allow you to fetch specific nodes.

    Be aware that namespace prefixes/aliases are only valid for an element and its descendants until redefined. They can change between documents and even on a descendant node in the same document. The namespace URI is the unique identifier of an namespace. The prefix/alias is just for human readability/document size. I will use a prefix different from the one in the document in the example.

    So you are looking for the child elements inside the {http://schemas.openxmlformats.org/wordprocessingml/2006/main}body element node (Clark notation).

    Load the data into an DOM document, register a prefix for the namespace and fetch the elements using an Xpath expression.

    $document = new DOMDocument();
    $document->loadXML(getXMLString());
    $xpath = new DOMXpath($document);
    // register the namespace using your own prefix
    $xpath->registerNamespace('main', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main');
    
    $result = [];
    // fetch children of the body element
    foreach ($xpath->evaluate('//main:body/*') as $mainElement) {
        // store namespace and node name
        $result[] = [
            'uri' => $mainElement->namespaceURI,
            'name' => $mainElement->localName
        ];
    }
    var_dump($result);
    

    Output:

    array(4) {
      [0]=>
      array(2) {
        ["uri"]=>
        string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        ["name"]=>
        string(1) "p"
      }
      [1]=>
      array(2) {
        ["uri"]=>
        string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        ["name"]=>
        string(3) "tbl"
      }
      [2]=>
      array(2) {
        ["uri"]=>
        string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        ["name"]=>
        string(1) "p"
      }
      [3]=>
      array(2) {
        ["uri"]=>
        string(60) "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        ["name"]=>
        string(6) "sectPr"
      }
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search