I need to improve some open source code. It contains a function to extract all <a>
and <img>
tags with a specific class from a string that represents HTML. This function uses regular expressions:
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
preg_match_all('#<a(.*)class="(.*)foo(.*)">(.*)</a>#Umsi', $text, $matches_a, PREG_PATTERN_ORDER);
// Build the union set from $matches_img and $matches_a
This works mostly, but not always. Specifically, the regular expressions can match multiple tags in a single match:
$text = '<a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a>';
// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
My first approach
I tried to make the regular expression more specific:
// old
<a(.*)class="(.*)foo(.*)">(.*)</a>
// new
<a([^<>]*)class="(.*)edusharing_atto(.*)">([^<>]*)</a>
But this, too, can match substrings that contain multiple tags:
$text = '<img class="bar"><img class="foo" src="baz">';
// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
Is there a robust way to improve on this? I could replace the second and third groups ((.*)
) with [^<>]*
as well, but then I’d run into trouble as soon as an images’s alt or title attribute contains a pointed bracket.
My second approach: DOMDocument
I tried to parse the HTML with this function:
function getElementsByClassName($html, $className, $tagName=null) {
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
if ($tagName){
$elements = $dom->getElementsByTagName($tagName);
} else {
$elements = $dom->getElementsByTagName("*");
}
$matched = [];
for ($i=0; $i<$elements->length; $i++) {
if ($elements->item($i)->attributes->getNamedItem('class')) {
$classes = $elements->item($i)->attributes->getNamedItem('class')->nodeValue;
if (str_contains($classes, $className)) {
$matched[]=$dom->saveHTML($elements->item($i));
}
}
}
return $matched;
}
The problem here is that the matches I get do not correspond exactly to the input. There seems to be some encoding problem, but more important, DOMDocument was written to parse HTML 4. The function does give me me all the tags I need to extract, but there are some problems with special characters and syntax differences between HTML 4 and HTML 5. I need to get the tags exactly as they are contained in the input string.
Is there a robust solution to achieve this?
2
Answers
One possibility using
DOMDocument/DOMXPath
which I think should be pretty robust given the constraints in your question:Output:
Demo on 3v4l.org
Oh, you are SO LUCKY I read this. I have extensive experience with this EXACT problem, and I know JUST how deep this rabbit hole goes. (VERY)
Fortunately, there is a simple solution… that I found after so. much. effort. Here it is:
In order to properly support all of the modern and malformed HTML content that you might encounter, you’ll need the HTML Tidy binary:
https://www.html-tidy.org/
It’s a tiny, standalone binary that will correctly interpret and standardize all of the HTML that it encounters.
Next, you’ll need a modern implementation of the DOMElement (and its related classes). I recommend
MastermindsHTML5
:https://github.com/Masterminds/html5-php
This is available as a Composer package (
composer require 'masterminds/html5'
).After that, you’re off the races!
You can now use those DOM classes without any issues:
https://www.php.net/manual/en/book.dom.php
You’ll find that it can handle your exact use case very well!
Here’s the class that I wrote for interfacing with the HTML Tidy binary:
And here’s the matching configuration file:
At this point, you should be good to go!
However, let me just say that if you succumb to temptation and attempt to write regular expressions to parse the HTML, then you are descending into a fiery pit of Hell. As you fall deeper and deeper into the W3C standards, you will eventually be consumed by them.
Save yourself! Follow the one true path to salvation: outsource this burden to third-parties, who have literally spent decades trying to get it all perfect for you, and will likely spend decades more as HTML continues to evolve.