I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here’s a sample of what I’m trying to sanitize.
<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
<head>
<meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
</head>
<body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=WordSection1>
<h1>Pros and Cons of a Website</h1>
<p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p> </o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
<p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
</p>
</div>
<div class=WordSection2>...same pattern in div 1</div>
<div class=WordSection3>...same...</div>
</body>
</html>
What I need from all of this is:
<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
What I have so far:
$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
if($node->tagName=='script') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
$attrs = $xpath->query('@*', $node);
foreach($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));
It gives me:
<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
<div>
<h1>Pros and Cons of a Website</h1>
<p><p> </p></p>
<p>A SAMPLE TEXT</p>
</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
</body>
which I’m good with, but I want the body tag out. I also want h1 and it’s content out too, but when I say:
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
something weird happens:
<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>
I’ve come across some very good answers like:
- How to get innerHTML of DOMNode? (Haim Evgi’s answer, I don’t know how to properly implement it, Keyacom’s answer too), Marco Marsala’s answer is the closest I got but the divs all kept their classes.
2
Answers
The removal of
h1
shifts the list of$nodes
, causing<p class="MsoBodyText">
to be skipped in the next iteration. To avoid this, replaceforeach
with afor
loop and decrement the current index whenever a node is removed.Then, the
saveHTML()
function can be invoked for each child node, resulting in a combined output that omits the parentbody
tag.As an alternative, extract the text alone and recreate the wrapping tag.
If you’ve got a recent PHP version at hand (8+), you can create a fragment of all the body elements and using saveXML() on it:
it will move the child nodes into the fragment, so this would only be useful for the inner HTML problem and can only be applied once. Therefore it depends where you put it in.
It may show though, that it is often better to collect the element in the fragment you want to export by appending them instead of removing from the original document the unwanted ones.