skip to Main Content

I’ve been at this for half a day, so now it’s time to ask for help.

What I’d like is for DOMDocument to leave existing entities and utf-8 characters alone. I’m now thinking this is not possible using only DOMDocument.

$html =
'<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <p>' &quot; & &lt; © 庭</p>
    </body>
</html>';

Then I run:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

And get entity output:

input: ' &quot; & &lt; © 庭
output: ' " &amp; &lt; &copy; 庭

Why is DOMDocument converting ' and &quot; to actual quote marks? The only thing it didn’t touch was &lt;.

Pretty sure the copyright symbol is being converted because DOMDocument doesn’t think the input html is utf-8, but I’m utterly confused why it’s converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn’t.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.

2

Answers


  1. Chosen as BEST ANSWER

    I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

    require 'vendor/autoload.php';
    
    $dom = new IvoPetkovHTML5DOMDocument();
    $dom->loadHTML($html, LIBXML_NOERROR);
    
    echo $dom->saveHTML();
    

    Result:

    input: &#39; &quot; &lt; © 庭 &nbsp; &
    output: &#39; &quot; &lt; © 庭 &nbsp; &amp;
    

  2. You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don’t think there’s a way to prevent all entity encoding from happening, but it won’t try to encode every entity it can.

    $html = $dom->saveHTML($dom);
    // ' " &amp; &lt; © 庭
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search