skip to Main Content

I have an XML document that is an ITF-16 LE Encoding. Because of that, It is not readable using wp all import.

When I look in the version section, I see this

<?xml version="1.0" encoding="Unicode" ?>
And in my visual studio code I at the bottom I see.
UTF-16 LE

I already changed using Visual studio, but since it going to be a new file every time (in the same format). It would be great if PHP could transform it into UTF-8

<?xml version="1.0" encoding="Unicode" ?>
<root>
  <docs>

Is it possible to change the encoding of this file using PHP?

2

Answers


  1. Here is a generic XSLT that will copy your entire input XML as-is, but with the encoding specified in the xsl:output. What is left is just to run an XSLT transformation in PHP.

    XSLT

    <?xml version="1.0"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="yes" encoding="utf-8"/>
    
        <xsl:template match="node()|@*">
            <xsl:copy>
                <xsl:apply-templates select="node()|@*"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>
    
    Login or Signup to reply.
  2. DOMDocument::loadXML() reads the encoding attribute from the XML declaration. But Unicode is not a valid encoding afaik – I would expect UTF-16LE. The DOM API in PHP uses UTF-8. So it will decode anything to UTF-8 (depending on the defined encoding) and encode it depending on the encoding of the target document. You can just change it after loading.

    Here is a demo:

    $xml = <<<'XML'
    <?xml version="1.0" encoding="utf-8"?>
    <foo>ÄÖÜ</foo>
    XML;
    
    $document = new DOMDocument();
    $document->loadXML($xml);
    
    $encodings = ['ASCII', 'UTF-16', 'UTF-16LE', 'UTF-16BE'];
    
    foreach ($encodings as $encoding) {
        // set required encoding
        $document->encoding = $encoding;
        // save
        echo $encoding."n".$document->saveXML()."n";
    }
    

    Output:

    ASCII
    <?xml version="1.0" encoding="ASCII"?>
    <foo>&#196;&#214;&#220;</foo>
    
    UTF-16
    ��<?xml version="1.0" encoding="UTF-16"?>
    <foo>���</foo>
    
    UTF-16LE
    <?xml version="1.0" encoding="UTF-16LE"?>
    <foo>���</foo>
    
    UTF-16BE
    <?xml version="1.0" encoding="UTF-16BE"?>
    <foo>���</foo>
    

    The generated string changes with the defined encoding.

    I started with an UTF-8 document here – because SO is UTF-8 itself and you can see the non-ascii characters that way. ASCII triggers the entity encoding for non-ascii characters. UTF-16 adds a BOM to provide the byte order. SO can not display the UTF-16 encoded chars – so you get the � symbol. UTF-16LE and UTF-16BE define the byte order in the encoding, no BOM is needed.

    Of course it works the same the other way around.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search