I have a very large xml file with the following format (this is a very small snip of two of the sections).
<?xml version="1.0" standalone="yes"?>
<LaunchBox>
<Game>
<Name>Violet</Name>
<ReleaseYear>1985</ReleaseYear>
<MaxPlayers>1</MaxPlayers>
<Platform>ZiNc</Platform>
</Game>
<Game>
<Name>Wishbringer</Name>
<ReleaseYear>1985</ReleaseYear>
<MaxPlayers>1</MaxPlayers>
<Platform>ZiNc</Platform>
</Game>
<Platform>
<Name>3DO Interactive Multiplayer</Name>
<Emulated>true</Emulated>
<ReleaseDate>1993-10-04T00:00:00-07:00</ReleaseDate>
<Developer>The 3DO Company</Developer>
</Platform>
<Platform>
<Name>Commodore Amiga</Name>
<Emulated>true</Emulated>
<ReleaseDate>1985-07-23T00:00:00-07:00</ReleaseDate>
<Developer>Commodore International</Developer>
</Platform>
</LaunchBox>
I would like to quickly find the instances of all the parent elements (i.e. Game
and Platform
in the above example) to count them but also to extract the contents.
To complicate matters, there is also a Platform
"child" inside Game
(which I don’t want to count). I only want the Parent (i.e. I do not want Game -> Platform
but I do want just Platform
.
From a combination of this site and Google I came up with the following function code:
$attributeCount = 0;
$xml = new XMLReader();
$xml->open($xmlFile);
$elements = new XMLElementIterator($xml, $sectionNameWereGetting);
// $sectionNameWereGetting is a variable that changes to Game and Platform etc
foreach( $elements as $key => $indElement ){
if ($xml->nodeType == XMLReader::ELEMENT && $xml->name == $sectionNameWereGetting) {
$parseElement = new SimpleXMLElement($xml->readOuterXML());
// NOW I CAN COUNT IF THE ELEMENT HAS CHILDREN
$thisCount = $parseElement->count();
unset($parseElement);
if ($thisCount == 0){
// IF THERE'S NO CHILDREN THEN SKIP THIS ELEMENT
continue;
}
// IF THERE IS CHILDREN THEN INCREMENT THE COUNT
// - IN ANOTHER FUNCTION I GRAB THE CONTENTS HERE
// - AND PUT THEM IN THE DATABASE
$attributeCount++;
}
}
unset($elements);
$xml->close();
unset($xml);
return $attributeCount;
I’m using the excellent script by Hakre at https://github.com/hakre/XMLReaderIterator/blob/master/src/XMLElementIterator.php
This does work. But I think assigning a new SimpleXMLElement is slowing the operation down.
I only need the SimpleXMLElement to check if the element has children (which I’m using to ascertain if the element is inside another parent or not – i.e. if it’s a parent it ‘will’ have children so I want to count it but, if it’s inside another parent then it won’t have children and I want to ignore it).
But perhaps there is a better solution than counting children? i.e. a $xml->isParent()
function or something?
The current function times out before it has fully counted all the sections of the xml (there are around 8 different sections and some of them have several 100,000’s of records).
How can I make this process more efficient as I’m also using similar code to grab the contents of the main sections and put them into a database so it will pay dividends to be as efficient as possible.
Also worth noting that I’m not particularly good at programming so please feel free to point out other mistakes I may have made so that I can improve.
4
Answers
** Solution ** Building on the shoulders of giants (thanks all who replied - espeically @ThW) I used the DOMDocument solution. With some time logging I found that the searching the document to get to the correct starting point was taking a lot of the time. So I looped around the 'while' to keep the pointer in the correct position. This has changed the transfer time from 4.5 hours down to a few minutes. When I 'break' from the while loop I return to an Ajax query that then updates the screen and re-runs until we have imported the whole XML.
It sounds like using a xpath instead of iterating over the XML might work for your use case. With an xpath you can select the specific nodes you need:
https://3v4l.org/bLLEi#v8.2.3
You do not need to serialize the XML to load it into DOM or SimpleXML. You can expand into a DOM document:
However counting the element children of the document element can be done with just the right calls to
XMLReader:read()
andXMLReader:next()
.read()
will navigate to the following node including descendants whilenext()
goes to the following sibling node – ignoring the descendants.Output:
I’m not sure I’ve fully understood your requirement but if the output you are looking for is:
then you can achieve it with this streamable XSLT 3.0 stylesheet:
XSLT 3.0 is available via a PHP API in the SaxonC product (caveat, this is my company’s product).