Php - Extract all <a> and <img> tags (exactly as represented in the input) from HTML 5 that contain a specific class

Pida
January 27, 2024
251 views
0 votes
2 Answers

I need to improve some open source code. It contains a function to extract all <a> and <img> tags with a specific class from a string that represents HTML. This function uses regular expressions:

preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
preg_match_all('#<a(.*)class="(.*)foo(.*)">(.*)</a>#Umsi', $text, $matches_a, PREG_PATTERN_ORDER);

// Build the union set from $matches_img and $matches_a

This works mostly, but not always. Specifically, the regular expressions can match multiple tags in a single match:

$text = '<a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a>';

// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

My first approach

I tried to make the regular expression more specific:

// old
<a(.*)class="(.*)foo(.*)">(.*)</a>

// new
<a([^<>]*)class="(.*)edusharing_atto(.*)">([^<>]*)</a>

But this, too, can match substrings that contain multiple tags:

$text = '<img class="bar"><img class="foo" src="baz">';

// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

Is there a robust way to improve on this? I could replace the second and third groups ((.*)) with [^<>]* as well, but then I’d run into trouble as soon as an images’s alt or title attribute contains a pointed bracket.

My second approach: DOMDocument

I tried to parse the HTML with this function:

function getElementsByClassName($html, $className, $tagName=null) {
    $dom = new DOMDocument('1.0'); 
    $dom->loadHTML($html);

    if ($tagName){
        $elements = $dom->getElementsByTagName($tagName);
    } else {
        $elements = $dom->getElementsByTagName("*");
    }

    $matched = [];

    for ($i=0; $i<$elements->length; $i++) {
        if ($elements->item($i)->attributes->getNamedItem('class')) {
            $classes = $elements->item($i)->attributes->getNamedItem('class')->nodeValue;
            if (str_contains($classes, $className)) {
                $matched[]=$dom->saveHTML($elements->item($i));
            }
        }
    }
    return $matched;
}

The problem here is that the matches I get do not correspond exactly to the input. There seems to be some encoding problem, but more important, DOMDocument was written to parse HTML 4. The function does give me me all the tags I need to extract, but there are some problems with special characters and syntax differences between HTML 4 and HTML 5. I need to get the tags exactly as they are contained in the input string.

Is there a robust solution to achieve this?

Answers

One possibility using DOMDocument/DOMXPath which I think should be pretty robust given the constraints in your question:

$html = '<img alt="äöü" class="foo" src="bar"><a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a><div class="foo">xxx</div>';

$contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$doc = new DOMDocument('1.0');
$doc->loadHTML($contentType . $html, LIBXML_NOERROR);

$xp = new DOMXPath($doc);
foreach ($xp->query('//*[contains(@class, "foo") and (self::a or self::img)]') as $el) {
    echo $el->ownerDocument->saveHTML($el) . PHP_EOL;
}

Output:

<img alt="äöü" class="foo" src="bar">
<a class="foo" src="target2">link text 2</a>

Demo on 3v4l.org

- Andy
- January 27, 2024 at 4:54 am
- 0 votes
0
Oh, you are SO LUCKY I read this. I have extensive experience with this EXACT problem, and I know JUST how deep this rabbit hole goes. (VERY)

Fortunately, there is a simple solution… that I found after so. much. effort. Here it is:

In order to properly support all of the modern and malformed HTML content that you might encounter, you’ll need the HTML Tidy binary:
https://www.html-tidy.org/

It’s a tiny, standalone binary that will correctly interpret and standardize all of the HTML that it encounters.

Next, you’ll need a modern implementation of the DOMElement (and its related classes). I recommend MastermindsHTML5:
https://github.com/Masterminds/html5-php

This is available as a Composer package (composer require 'masterminds/html5').

After that, you’re off the races!
1. Use the HTML Tidy command to sanitize your HTML input
2. Use the DOMElement to traverse and manipulate the DOM
You can now use those DOM classes without any issues:
https://www.php.net/manual/en/book.dom.php

You’ll find that it can handle your exact use case very well!

Here’s the class that I wrote for interfacing with the HTML Tidy binary:
```
<?php

namespace YourProject;

use Exception;

class TidyCommand
{
    private $command;

    private static $STDIN = 0;
    private static $STDOUT = 1;

    public function __construct (string $executablePath, string $settingsPath)
    {
        $settingsPathArgument = escapeshellarg($settingsPath);

        $this->command = "{$executablePath} -config {$settingsPathArgument}";
    }

    public function run (string $stdin): string
    {
        $descriptors = [
            self::$STDIN => ['pipe', 'r'],
            self::$STDOUT => ['pipe', 'w']
        ];

        echo $this->command, "n";
        $process = proc_open($this->command, $descriptors, $pipes);

        if (!is_resource($process)) {
            throw new Exception("Invalid resource: {$this->command}");
        }

        fwrite($pipes[self::$STDIN], $stdin);
        fclose($pipes[self::$STDIN]);

        $stdout = stream_get_contents($pipes[self::$STDOUT]);
        fclose($pipes[self::$STDOUT]);

        proc_close($process);

        return $stdout;
    }
}
```
And here’s the matching configuration file:
```
bare: no
char-encoding: utf8
clean: no
coerce-endtags: yes
drop-proprietary-attributes: yes
hide-comments: yes
newline: LF
quiet: yes
show-errors: 0
show-info: no
show-warnings: no
strict-tags-attributes: yes
tidy-mark: no
wrap: 0
```
At this point, you should be good to go!

However, let me just say that if you succumb to temptation and attempt to write regular expressions to parse the HTML, then you are descending into a fiery pit of Hell. As you fall deeper and deeper into the W3C standards, you will eventually be consumed by them.

Save yourself! Follow the one true path to salvation: outsource this burden to third-parties, who have literally spent decades trying to get it all perfect for you, and will likely spend decades more as HTML continues to evolve.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Php – Extract all <a> and <img> tags (exactly as represented in the input) from HTML 5 that contain a specific class

My first approach

My second approach: DOMDocument

Answers