I’m parsing a very large Xml files, so I need to use the XMLReader of PHP.
They cannot be modified from the source. So they have to be parsed as they are.
The problem is that the documents contain html chars "&#" inside that the reader detect as not valid.
$reader = new XMLReader();
if (!$reader->open($fileNamePath))//File xml
{
echo "Error opening file: $fileNamePath".PHP_EOL;
continue;
}
echo "Processing file: $file".PHP_EOL;
while($reader->read())
{
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO')
{
try {
$input =$reader->readOuterXML();
$nodeAiuto = new SimpleXMLElement($input);
}
catch(Exception $e)
{
echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
continue;
}
//Do stuff here
}
}
$reader->close();
I get a lot of messages like this:
PHP Warning: XMLReader::readOuterXml(): myfile.xml:162: parser error : xmlParseCharRef: invalid xmlChar value 2…
Errore Nodo AIUTO String could not be parsed as XML
Obviously the file contains the sequence 
.
here some xml file code causing the error:
<AIUTO><BASE_GIURIDICA_NAZIONALE>Quadro riepilogativo delle misure a sostegno delle imprese attive nei settori agricolo, forestale, della pesca
e acquacoltura ai sensi della Comunicazione della Commissione europea C (2020) 1863 final – “Quadro
temporaneo per le misure di aiuto di Stato a sostegno dell’economia nell’attuale emergenza del COVID19” e successive modifiche e integrazioni</BASE_GIURIDICA_NAZIONALE></AIUTO>
I thought to parse every file as text, line by line, and replace the invalid sequences.
But it’s a little tricky.
Has someone a better solution?
2
Answers
Been there with an xml file and I found that the best workaround is to replace the string with nothing:
If you can’t delete the data in xml, you can try to parse the xml then loop each one with:
What you can do is to build a custom stream filter in which you proceed to all the fix you need. This way you can continue to read the file as a stream with XMLReader without to load the full content at one time.
demo
You can find more informations about stream filters in the PHP manual and also in the book "Modern PHP – O’Reilly".