I have a string of html, containing text within some divs. I need to extract that text from the divs. (For curiosity sake, these extra divs appeared when a user copy/pasted into a contenteditable div)
Starting html:
<div>
Text1
<div>
<p>para</p>
Text2
</div>
<div>
Text3
</div>
</div>
The HTML that I want would be:
<div>
Text1
<p>para</p>
Text2
Text3
</div>
My plan was to use xpath to find all internal divs and "promote" their contents into the doc.
$doc = new DOMDocument();
$doc->loadHTML("<div>Text1<div><p>para</p>Text2</div><div>Text3</div></div>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->evaluate('/div/div[not(@*)]') as $node) {
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child) {
$frag->appendChild($child);
}
node->replaceWith($frag);
};
This sort-of works, but it gets confused with divs containing text as well as other html. The result is:
<div>Text1<p>para</p>Text3</div>
Why is the Text2
text node missing?
2
Answers
This is because your HTML is not valid, a
<div>
tag should not wrap a<p>
tag.Paragraphs are standalone blocks: The
<p>
tag represents a block of text or a paragraph. Wrapping a paragraph with a<div>
often introduces unnecessary markup without adding meaning.$frag->appendChild($child);
removes the child from its parent, and that in turn makes the foreach loop over$node->childNodes
skip an element.The loop removes the first child node, and after that it will proceed with the second element in the child node list. Only that is now not the original 2nd child any more – because the first child was removed, all the other children have moved up by one position.
There are several ways to avoid this – you could for example process the child nodes in reverse order (and then append them to the beginning of the document fragment each time.)
Another approach is to not use a document fragment to begin with – but simply insert each child before its parent, as long as there are children, and then just remove the original node afterwards:
(Note that with the given example input, you will end up with
Text2Text3
in the result. For them to still be separated, there would need to be whitespace between the inner div elements. In your original HTML that is given by the line breaks.)