skip to Main Content

This question has been asked a lot, but the posted answers do not work for me unfortunately.

I am trying to parse custom XML for documentation that has its own DTD and such. My goal is to generate HTML documentation from the XML markup of the documentation. The XML is given and cannot be modified, for all practical purposes.

Generating the HTML is easy – getting the XML into a program so that I can work with it seems to be the challenging part here. I’ve tried many different techniques, and they all seem to fail in some case or another.

  • PHP’s Simple XML parser natively does not contain child attributes (and a lot of other stuff) e.g. $xml = simplexml_load_string($xmlFile);
  • PHP’s Simple XML parser with json encode/decode cannot handle child nodes that contain attributes e.g. json_decode(json_encode($xml))
  • This solution I’ve found is the only one that can handle child nodes with attributes, but it doesn’t honor CDATA and basically butchers the entire file
  • Simply casting to array seems reasonable, but also fails to handle child nodes that contain attributes e.g. $xml = simplexml_load_string($file); $array = (array)$xml;
  • DOM Document gets totally confused and just generates a bunch of formatted plain text.
  • Other general issues include taking children nodes out of context inappropriately. Using CDATA mostly helps with this, but the solutions that handle this fine don’t handle the other things fine.

I was intending to parse the XML into an array, which is theoretically possible, but so far I have not been able to do this successfully.

The XML is 32,000 lines, approximately. The requirement is that I need to capture everything. This includes all attributes of all nodes and all content of all nodes. This includes capturing CDATA literally. Surprisingly, every major parsing solution excludes something.

Short of writing a custom program specifically to parse this particular XML, is there a solution or way to reliably capture everything into an array (or some mechanism that would allow iterating through the whole thing)?

Here is the full XML file for reference: https://interlinked.us/files/xml.txt

I’ll point out a few things:

  • I’m preprocessing the file by adding CDATA around certain tags:
$xmlFile = str_replace("<literal>", "<![CDATA[<literal>", $xmlFile);
$xmlFile = str_replace("</literal>", "</literal>]]>", $xmlFile);
$xmlFile = str_replace("<replaceable>", "<![CDATA[<replaceable>", $xmlFile);
$xmlFile = str_replace("</replaceable>", "</replaceable>]]>", $xmlFile);

This is because the end goal is simply to replace these with <span> or <b> or <code> or something like that, and I don’t want these particular nodes parsed as XML. Easy enough. That also requires that CDATA be honored, however.

  • Here is an example of XML that usually fails to parse properly in most solutions:
<application name="Reload" language="en_US">
        <synopsis>
            Reloads an Asterisk module, blocking the channel until the reload has completed.
        </synopsis>
        <syntax>
            <parameter name="module" required="false">
                <para>The full name(s) of the target module(s) or resource(s) to reload.
                If omitted, everything will be reloaded.</para>
                <para>The full names MUST be specified (e.g. <literal>chan_iax2</literal>
                to reload IAX2 or <literal>pbx_config</literal> to reload the dialplan.</para>
            </parameter>
        </syntax>
        <description>
            <para>Reloads the specified (or all) Asterisk modules and reports success or failure.
            Success is determined by each individual module, and if all reloads are successful,
            that is considered an aggregate success. If multiple modules are specified and any
            module fails, then FAILURE will be returned. It is still possible that other modules
            did successfully reload, however.</para>
            <para>Sets <variable>RELOADSTATUS</variable> to one of the following values:</para>
            <variablelist>
                <variable name="RELOADSTATUS">
                    <value name="SUCCESS">
                        Specified module(s) reloaded successfully.
                    </value>
                    <value name="FAILURE">
                        Some or all of the specified modules failed to reload.
                    </value>
                </variable>
            </variablelist>
        </description>
    </application>

The parsing failure is that SUCCESS and FAILURE are nowhere to be found in the parsed array! This seems to be because most XML parsers ignore attributes in leaf nodes.

  • Another likely requirement is the leaf nodes that themselves contain only text and are contained in a parent that contains other text should not be parsed as separate elements. As an example, in the output above, notice that the variable tag is used in multiple ways. It is used as a formatter similar to literal and replaceable, but also a node type of its own, as in variablelist.

  • The solution needs to be contained within a single script (but I would be okay with installing Debian packages). I’m most familiar with how to do this kind of thing in PHP, but open to other tools, especially if they are POSIX portable.

Ultimately, I’m not looking for the most elegant solution or output, but something that will at least work and fully capture everything. I seem to have exhausted the built-in PHP tools and common answers – any suggestions on how to approach this?

Again, the goal is to generate the HTML for a webpage from this. Hence, I need all of the attributes and values so that I can construct the webpage, properly in context.

The best I have found so far is xmlObjToArr() in the comments on the PHP page, which actually runs. But I checked and it does at least pass the leaf node attribute test, so I’m going to see if anything else is missing from that. All the other solutions execute instantly, as opposed to this which takes 45-60 seconds to run on an idle server, but if that’s what it takes to parse XML, I guess it is what it is.

2

Answers


  1. The easiest way to parse xml into an array that works perfectly for my needs is:

    $array = json_decode(json_encode(simplexml_load_string($xml)), 1);
    
    Login or Signup to reply.
  2. I’ve worked on this question for a few days and found out that the best solution was to create a class implementing IteratorAggregate to iterate on a SimpleXMLIterator.

    /**
     * Class XMLDeserializerIterator
     */
    class XMLDeserializerIterator implements IteratorAggregate
    {
        /**
         * @var SimpleXMLIterator
         */
        private $iterator;
    
    
        /**
         * @param SimpleXMLIterator $iterator
         *
         * @return void
         */
        public function __construct(SimpleXMLIterator $iterator)
        {
            $this->iterator = $iterator;
        }
    
        /**
         * @return Generator
         *
         * @throws Throwable
         */
        public function getIterator(): Generator
        {
            foreach ($this->iterator->nodeName->childNodeName as $xmlElement) {
    
                yield doWhatYouWantWith($xmlElement);
            }
        }
    

    The SimpleXMLIterator needed to construct this object can be created like that :

    $xmlIterator = new SimpleXMLIterator('yourXMLString')
    

    or

    $xmlIterator = new SimpleXMLIterator('pathToYourXMLFile', 0, true);
    

    You can access attributes with a simple $xmlElement->attributeName where attributeName is its real name in your XML file.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search