skip to Main Content

I’m parsing a very large Xml files, so I need to use the XMLReader of PHP.
They cannot be modified from the source. So they have to be parsed as they are.
The problem is that the documents contain html chars "&#" inside that the reader detect as not valid.


        $reader = new XMLReader();
    
        if (!$reader->open($fileNamePath))//File xml
            {
            echo "Error opening file: $fileNamePath".PHP_EOL;
            continue;
            }
        echo "Processing file: $file".PHP_EOL;
       
           
        while($reader->read()) 
            {
            
            if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO') 
                {
                
                try {
                    $input =$reader->readOuterXML();
                    $nodeAiuto = new SimpleXMLElement($input);
                    }
                catch(Exception $e)
                    {
                    echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
                    continue;
                    }
                //Do stuff here
                }
         }
    
         $reader->close();

I get a lot of messages like this:

PHP Warning: XMLReader::readOuterXml(): myfile.xml:162: parser error : xmlParseCharRef: invalid xmlChar value 2…
Errore Nodo AIUTO String could not be parsed as XML

Obviously the file contains the sequence .

here some xml file code causing the error:

<AIUTO><BASE_GIURIDICA_NAZIONALE>Quadro riepilogativo delle misure a sostegno delle imprese attive nei settori agricolo, forestale, della pesca 
e acquacoltura ai sensi della Comunicazione della Commissione europea C (2020) 1863 final – “Quadro 
temporaneo per le misure di aiuto di Stato a sostegno dell’economia nell’attuale emergenza del COVID&#2;19” e successive modifiche e integrazioni</BASE_GIURIDICA_NAZIONALE></AIUTO>

I thought to parse every file as text, line by line, and replace the invalid sequences.

But it’s a little tricky.
Has someone a better solution?

2

Answers


  1. Been there with an xml file and I found that the best workaround is to replace the string with nothing:

    $xml= str_replace('YOUR STIRNG',NULL,$xml);
    

    If you can’t delete the data in xml, you can try to parse the xml then loop each one with:

    $xml= simplexml_load_file('file.xml');
    foreach($xml as $object){
      your code...
    }
    
    Login or Signup to reply.
  2. What you can do is to build a custom stream filter in which you proceed to all the fix you need. This way you can continue to read the file as a stream with XMLReader without to load the full content at one time.

    class fix_entities_filter extends php_user_filter
    {
        function filter($in, $out, &$consumed, $closing): int
        {
            while ($bucket = stream_bucket_make_writeable($in)) {
                $bucket->data = $this->fix($bucket->data);
                $consumed += $bucket->datalen;
                stream_bucket_append($out, $bucket);
            }
            return PSFS_PASS_ON;
        }
        
        function fix($data)
        {
            return strtr($data, ['&#2;' => '&#x202f;']);
        }
    }
    
    stream_filter_register("fix_entities", "fix_entities_filter")
        or die("Failed to register filter");
    
    $file = 'file.xml';
    $fileNamePath = "/path/to/your/$file";
    $path = "php://filter/read=fix_entities/resource=$fileNamePath";
    
    $reader = new XMLReader();
        
    if (!$reader->open($path)) {
        echo "Error opening file: $fileNamePath", PHP_EOL;
    }
    

    demo

    You can find more informations about stream filters in the PHP manual and also in the book "Modern PHP – O’Reilly".

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search