I’ve been at this for half a day, so now it’s time to ask for help.
What I’d like is for DOMDocument to leave existing entities and utf-8 characters alone. I’m now thinking this is not possible using only DOMDocument.
$html =
'<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<p>' " & < © 庭</p>
</body>
</html>';
Then I run:
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);
echo $dom->saveHTML();
And get entity output:
input: ' " & < © 庭
output: ' " & < © 庭
Why is DOMDocument converting '
and "
to actual quote marks? The only thing it didn’t touch was <
.
Pretty sure the copyright symbol is being converted because DOMDocument doesn’t think the input html is utf-8, but I’m utterly confused why it’s converting the quotes back to non-entities.
I thought the mb_convert_encoding
trick would fix the utf-8 issue, but it hasn’t.
Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html);
trick.
2
Answers
I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.
Result:
You need to provide a specific element to the
saveHTML()
method. This will have it do a minimalist approach to encoding entities. It will still encode those that are necessary. I don’t think there’s a way to prevent all entity encoding from happening, but it won’t try to encode every entity it can.