skip to Main Content

I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here’s a sample of what I’m trying to sanitize.

<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
    <head>
        <meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
    </head>
    <body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
        <div class=WordSection1>
            <h1>Pros and Cons of a Website</h1>
            <p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p>&nbsp;</o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
            <p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
                A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
            </p>
        </div>
        <div class=WordSection2>...same pattern in div 1</div>
        <div class=WordSection3>...same...</div>
   </body>
</html>

What I need from all of this is:

<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>

What I have so far:

$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
    if($node->tagName=='script') $node->parentNode->removeChild($node);
    if($node->tagName=='a') continue;
    $attrs = $xpath->query('@*', $node);
    foreach($attrs as $attr) {
        $attr->parentNode->removeAttribute($attr->nodeName);
    }
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));

It gives me:

<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
    <div>
        <h1>Pros and Cons of a Website</h1>
        <p><p> </p></p>
        <p>A SAMPLE TEXT</p>
    </div>
    <div>...same pattern in div 1</div>
    <div>...same...</div>
</body>

which I’m good with, but I want the body tag out. I also want h1 and it’s content out too, but when I say:

if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);

something weird happens:

<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>

I’ve come across some very good answers like:

  1. How to get innerHTML of DOMNode? (Haim Evgi’s answer, I don’t know how to properly implement it, Keyacom’s answer too), Marco Marsala’s answer is the closest I got but the divs all kept their classes.

2

Answers


  1. The removal of h1 shifts the list of $nodes, causing <p class="MsoBodyText"> to be skipped in the next iteration. To avoid this, replace foreach with a for loop and decrement the current index whenever a node is removed.

    $dom = new DOMDocument;
    @$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    
    $bodyNode = $xpath->query('//html/body')->item(0);
    $nodes = $bodyNode->getElementsByTagName('*');
    
    for ($i = 0; $i < $nodes->count(); $i++) {
        $node = $nodes->item($i);
        if ($node->tagName == 'script' || $node->tagName == 'h1') {
            $node->parentNode->removeChild($node);
            $i--;
        }
        if ($node->tagName == 'a') {
            continue;
        }
        $attrs = $xpath->query('@*', $node);
        foreach ($attrs as $attr) {
            $attr->parentNode->removeAttribute($attr->nodeName);
        }
    }
    echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($bodyNode)) . PHP_EOL;
    

    Then, the saveHTML() function can be invoked for each child node, resulting in a combined output that omits the parent body tag.

    $inner = [];
    foreach ($bodyNode->childNodes as $node) {
        $inner []= trim($bodyNode->ownerDocument->saveHTML($node));
    }
    echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
    

    As an alternative, extract the text alone and recreate the wrapping tag.

    $inner = [];
    foreach ($bodyNode->childNodes as $node) {
        $text = trim($node->textContent);
        if ($node->nodeType != XML_ELEMENT_NODE) {
            $inner []= $text;
            continue;
        }
        $inner []= sprintf('<%s>%s</%s>',
            $node->tagName, $text, $node->tagName);
    }
    echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
    
    Login or Signup to reply.
  2. If you’ve got a recent PHP version at hand (8+), you can create a fragment of all the body elements and using saveXML() on it:

    $element = $body->item(0); # the body element itself from xpath result
    
    $fragment = $dom->createDocumentFragment(); 
    $fragment->append(...$element->childNodes);
    
    echo str_ireplace(['<span>', '</span>'], '', $dom->saveXML($fragment));
    

    it will move the child nodes into the fragment, so this would only be useful for the inner HTML problem and can only be applied once. Therefore it depends where you put it in.

    It may show though, that it is often better to collect the element in the fragment you want to export by appending them instead of removing from the original document the unwanted ones.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search