Why is DOMDocument converting both html quote-entities to actual quotes? - PHP

Jeff
April 20, 2023
266 views
0 votes
2 Answers

I’ve been at this for half a day, so now it’s time to ask for help.

What I’d like is for DOMDocument to leave existing entities and utf-8 characters alone. I’m now thinking this is not possible using only DOMDocument.

$html =
'<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <p>' &quot; & &lt; © 庭</p>
    </body>
</html>';

Then I run:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

And get entity output:

input: ' &quot; & &lt; © 庭
output: ' " &amp; &lt; &copy; 庭

Why is DOMDocument converting ' and " to actual quote marks? The only thing it didn’t touch was <.

Pretty sure the copyright symbol is being converted because DOMDocument doesn’t think the input html is utf-8, but I’m utterly confused why it’s converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn’t.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.

Tags: domdocument php

Answers

Chosen as BEST ANSWER
- Jeff
- April 20, 2023 at 5:20 pm
- 0 votes
0
I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.
```
require 'vendor/autoload.php';

$dom = new IvoPetkovHTML5DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();
```
Result:
```
input: &#39; &quot; &lt; © 庭 &nbsp; &
output: &#39; &quot; &lt; © 庭 &nbsp; &amp;
```

(Edit)

- Jim
- April 20, 2023 at 12:38 am
- 0 votes
0
You need to provide a specific element to the saveHTML() method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don’t think there’s a way to prevent all entity encoding from happening, but it won’t try to encode every entity it can.
```
$html = $dom->saveHTML($dom);
// ' " &amp; &lt; © 庭
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Why is DOMDocument converting both html quote-entities to actual quotes? – PHP

Answers