skip to Main Content

I have a very large xml file with the following format (this is a very small snip of two of the sections).

<?xml version="1.0" standalone="yes"?>
<LaunchBox>
  <Game>
    <Name>Violet</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Game>
    <Name>Wishbringer</Name>
    <ReleaseYear>1985</ReleaseYear>
    <MaxPlayers>1</MaxPlayers>
    <Platform>ZiNc</Platform>
  </Game>
  <Platform>
    <Name>3DO Interactive Multiplayer</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
    <Developer>The 3DO Company</Developer>
  </Platform>
  <Platform>
    <Name>Commodore Amiga</Name>
    <Emulated>true</Emulated>
    <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
    <Developer>Commodore International</Developer>
  </Platform>
</LaunchBox>

I would like to quickly find the instances of all the parent elements (i.e. Game and Platform in the above example) to count them but also to extract the contents.

To complicate matters, there is also a Platform "child" inside Game (which I don’t want to count). I only want the Parent (i.e. I do not want Game -> Platform but I do want just Platform.

From a combination of this site and Google I came up with the following function code:

$attributeCount = 0;

$xml = new XMLReader();
$xml->open($xmlFile);
$elements = new XMLElementIterator($xml, $sectionNameWereGetting);
// $sectionNameWereGetting is a variable that changes to Game and Platform etc

foreach( $elements as $key => $indElement ){
            if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
                $parseElement = new SimpleXMLElement($xml->readOuterXML());
// NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
                $thisCount = $parseElement->count();
                unset($parseElement);
                if ($thisCount == 0){
// IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
                    continue;
                }
// IF THERE IS CHILDREN THEN INCREMENT THE COUNT
// - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
// - AND PUT THEM IN THE DATABASE
                $attributeCount++;
            }
}
unset($elements);
$xml->close();
unset($xml);

return  $attributeCount;

I’m using the excellent script by Hakre at https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php

This does work. But I think assigning a new SimpleXMLElement is slowing the operation down.

I only need the SimpleXMLElement to check if the element has children (which I’m using to ascertain if the element is inside another parent or not – i.e. if it’s a parent it ‘will’ have children so I want to count it but, if it’s inside another parent then it won’t have children and I want to ignore it).

But perhaps there is a better solution than counting children? i.e. a $xml->isParent() function or something?

The current function times out before it has fully counted all the sections of the xml (there are around 8 different sections and some of them have several 100,000’s of records).

How can I make this process more efficient as I’m also using similar code to grab the contents of the main sections and put them into a database so it will pay dividends to be as efficient as possible.

Also worth noting that I’m not particularly good at programming so please feel free to point out other mistakes I may have made so that I can improve.

4

Answers


  1. Chosen as BEST ANSWER

    ** Solution ** Building on the shoulders of giants (thanks all who replied - espeically @ThW) I used the DOMDocument solution. With some time logging I found that the searching the document to get to the correct starting point was taking a lot of the time. So I looped around the 'while' to keep the pointer in the correct position. This has changed the transfer time from 4.5 hours down to a few minutes. When I 'break' from the while loop I return to an Ajax query that then updates the screen and re-runs until we have imported the whole XML.

            $reader = new XMLReader();
            $reader->open($xmlFile);
    
            $document = new DOMDocument();
            $xpath = new DOMXpath($document);
    
            $found = false;
            // look for the document element
            do {
              $found = $found ? $reader->next() : $reader->read();
            } while (
              $found && 
              $reader->localName !== 'LaunchBox'
            );
    
            // go to first child of the document element
            if ($found) {
                $found = $reader->read();
            }
    
            $counts = [];
    
            while ($found && $reader->depth === 1) {
    
                $currentElementKey++;
    
                if( $currentElementKey <= $positionInDocument ){
                    // WE DON'T WANT THIS RECORD AS WE'VE ALREADY ADDED IT
                    $reader->next();                
                }    
    
                if ($reader->nodeType === XMLReader::ELEMENT && $reader->localName == $sectionNameWereGetting) {
    
    
                    // expand into DOM 
                    $node = $reader->expand($document);
                    // import DOM into SimpleXML 
                    $simpleXMLObject = simplexml_import_dom($node);
    
                    // TRANSFER OBJECT INTO ARRAY READY FOR DATABASE
                    foreach($simpleXMLObject as $elIndex => $elContent){
                        $addRecord[$elIndex] = trim($elContent);
                    }
    
                    // MAKE ARRAY OF ARRAYS FOR DATABASE
                    $allRecordsToAdd[] = $addRecord;
                    // INCREMENT THE COUNT OF RECORDS WE'VE TRANSFERRED
                    $currentRecordNumberTransferring++;
                    // clearing current element
                    unset($simpleXMLObject);
    
    
                }
                $positionInDocument = $currentElementKey;
                $reader->next();
                if( $currentRecordNumberTransferring >= $nextStoppingPoint ){
                    // WE NEED TO STOP AND REPORT BACK
    
                    DB::disableQueryLog();              
                    DB::table($dbTableName)->insert($allRecordsToAdd);
                    $allRecordsToAdd = array();
    
                    $loopTheWhileForSpeed++;
                    if( $loopTheWhileForSpeed < $maxLoops ){
                        $nextStoppingPoint = self::calculateNextAjaxStoppingPoint($currentRecordNumberTransferring, $totalNumberOfRecords, $maxRecordsAtATime);           
                    } else {
                        break;
                    }
    
                    
                }
    
    
            }
    
        $documentStats["positionInDocument"] = $positionInDocument;
        $documentStats["currentRecordNumberTransferring"] = $currentRecordNumberTransferring;
    
    
        $reader->close();
        unset($reader);
        unset($document);
        unset($xpath);
    
        return  $documentStats;
    

  2. It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:

    $xml = simplexml_load_string($xmlStr);
    
    $games = $xml->xpath('/LaunchBox/Game');
    
    echo count($games).' games'.PHP_EOL;
    
    foreach ($games as $game) {
        print_r($game);
    }
    

    https://3v4l.org/bLLEi#v8.2.3

    Login or Signup to reply.
  3. You do not need to serialize the XML to load it into DOM or SimpleXML. You can expand into a DOM document:

    $reader = new XMLReader();
    $reader->open(getXMLDataURL());
    
    $document = new DOMDocument();
    
    // navigate using read()/next()
    
    while ($found) {
      // expand into DOM 
      $node = $reader->expand($document);
      // import DOM into SimpleXML 
      $simpleXMLObject = simplexml_import_dom($node);
     
      // navigate using read()/next()
    }
    

    However counting the element children of the document element can be done with just the right calls to XMLReader:read() and XMLReader:next(). read() will navigate to the following node including descendants while next() goes to the following sibling node – ignoring the descendants.

    $reader = new XMLReader();
    $reader->open(getXMLDataURL());
    
    $document = new DOMDocument();
    $xpath = new DOMXpath($document);
    
    $found = false;
    // look for the document element
    do {
      $found = $found ? $reader->next() : $reader->read();
    } while (
      $found && 
      $reader->localName !== 'LaunchBox'
    );
    
    // go to first child of the document element
    if ($found) {
        $found = $reader->read();
    }
    
    $counts = [];
    
    // found a node at depth 1 
    while ($found && $reader->depth === 1) {
         if ($reader->nodeType === XMLReader::ELEMENT) {
            if (isset($counts[$reader->localName])) {
                $counts[$reader->localName]++;
            } else {
                $counts[$reader->localName] = 1;
            }
        }
        // go to next sibling node
        $found = $reader->next();
    }
    
    var_dump($counts);
    
    
    function getXMLDataURL() {
       $xml = <<<'XML'
    <?xml version="1.0" standalone="yes"?>
    <LaunchBox>
      <Game>
        <Name>Violet</Name>
        <ReleaseYear>1985</ReleaseYear>
        <MaxPlayers>1</MaxPlayers>
        <Platform>ZiNc</Platform>
      </Game>
      <Game>
        <Name>Wishbringer</Name>
        <ReleaseYear>1985</ReleaseYear>
        <MaxPlayers>1</MaxPlayers>
        <Platform>ZiNc</Platform>
      </Game>
      <Platform>
        <Name>3DO Interactive Multiplayer</Name>
        <Emulated>true</Emulated>
        <ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
        <Developer>The 3DO Company</Developer>
      </Platform>
      <Platform>
        <Name>Commodore Amiga</Name>
        <Emulated>true</Emulated>
        <ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
        <Developer>Commodore International</Developer>
      </Platform>
    </LaunchBox>
    XML;
        return 'data:application/xml;base64,'.base64_encode($xml);
    }
    

    Output:

    array(2) {
      ["Game"]=>
      int(2)
      ["Platform"]=>
      int(2)
    }
    
    Login or Signup to reply.
  4. I’m not sure I’ve fully understood your requirement but if the output you are looking for is:

    { "Game":2, "Platform":2 }
    

    then you can achieve it with this streamable XSLT 3.0 stylesheet:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:map="http://www.w3.org/2005/xpath-functions/map"
       version="3.0">
      
       <xsl:mode streamable="yes"/>
       <xsl:output method="json" indent="yes"/>
       <xsl:template match="/">
          <xsl:sequence select="fold-left(/*/*/local-name(), map{}, 
             function($map, $name){
               map:put($map, $name, 
                 if (map:contains($map, $name)) 
                 then map:get($map, $name) + 1 
                 else 1)})"/>
       </xsl:template>
       
    </xsl:stylesheet>
    

    XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company’s product).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search