I have a method which takes an html string and loops through each html tag, adding the text contents to an associative array, which is then json_encoded into JSON file form.
For some reason the JSON file I create has weird characters like you can see in the photo.
Storage::disk('public')->put($fileName, json_encode($newArray, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT));
My full method:
$htmlString = '<section>
<h2>CCPA Privacy Notice Addendum</h2>
<p>This California Consumer Privacy Act (CCPA) Privacy Notice Addendum supplements the information provided in the [App Name] Privacy Policy and applies solely to residents of the State of California ("consumers" or "you"). We adopt this addendum to comply with the CCPA and provide you with the required information about your rights under the CCPA.</p>
</section>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($htmlString);
$count = 0;
$keyPattern = 'ccpaRights';
$newArray = [];
foreach ($dom->getElementsByTagName('section') as $section)
{
// loop through each child of <section>
foreach ($section->childNodes as $childNode)
{
$nodeValue = $childNode->nodeValue;
if ($nodeValue === '' )
{
continue;
}
$count = $count + 1;
$key = (string) $keyPattern.'Text'.$count;
$newArray[$key] = $nodeValue;
}
}
$fileName = '/temp/translated-'.rand(1,1000).'.json';
Storage::disk('public')->put($fileName, json_encode($newArray, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT));
2
Answers
DomDocument
‘sloadHTML()
method will load the markup with the ISO-8859-1 character set by default if a character encoding is not explicitly stated.That said, the link I provided uses an out-of-date method to fix this. The functions used have been deprecated as of PHP 8.2 and may be removed in 8.3+.
The 8.3-compatible alternative used by most frameworks is
Adding that to your code,
Should output a properly-encoded JSON file.
The
loadHTML
method doesn’t load the HTML string in a UTF-8 format. So, one simple way to overcome this is replacing this line:With this: