skip to Main Content

I need to improve some open source code. It contains a function to extract all <a> and <img> tags with a specific class from a string that represents HTML. This function uses regular expressions:

preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
preg_match_all('#<a(.*)class="(.*)foo(.*)">(.*)</a>#Umsi', $text, $matches_a, PREG_PATTERN_ORDER);

// Build the union set from $matches_img and $matches_a

This works mostly, but not always. Specifically, the regular expressions can match multiple tags in a single match:

$text = '<a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a>';

// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

My first approach

I tried to make the regular expression more specific:

// old
<a(.*)class="(.*)foo(.*)">(.*)</a>

// new
<a([^<>]*)class="(.*)edusharing_atto(.*)">([^<>]*)</a>

But this, too, can match substrings that contain multiple tags:

$text = '<img class="bar"><img class="foo" src="baz">';

// matches whole string in a single match
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

Is there a robust way to improve on this? I could replace the second and third groups ((.*)) with [^<>]* as well, but then I’d run into trouble as soon as an images’s alt or title attribute contains a pointed bracket.

My second approach: DOMDocument

I tried to parse the HTML with this function:

function getElementsByClassName($html, $className, $tagName=null) {
    $dom = new DOMDocument('1.0'); 
    $dom->loadHTML($html);

    if ($tagName){
        $elements = $dom->getElementsByTagName($tagName);
    } else {
        $elements = $dom->getElementsByTagName("*");
    }

    $matched = [];

    for ($i=0; $i<$elements->length; $i++) {
        if ($elements->item($i)->attributes->getNamedItem('class')) {
            $classes = $elements->item($i)->attributes->getNamedItem('class')->nodeValue;
            if (str_contains($classes, $className)) {
                $matched[]=$dom->saveHTML($elements->item($i));
            }
        }
    }
    return $matched;
}

The problem here is that the matches I get do not correspond exactly to the input. There seems to be some encoding problem, but more important, DOMDocument was written to parse HTML 4. The function does give me me all the tags I need to extract, but there are some problems with special characters and syntax differences between HTML 4 and HTML 5. I need to get the tags exactly as they are contained in the input string.

Is there a robust solution to achieve this?

2

Answers


  1. One possibility using DOMDocument/DOMXPath which I think should be pretty robust given the constraints in your question:

    $html = '<img alt="äöü" class="foo" src="bar"><a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a><div class="foo">xxx</div>';
    
    $contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
    $doc = new DOMDocument('1.0');
    $doc->loadHTML($contentType . $html, LIBXML_NOERROR);
    
    $xp = new DOMXPath($doc);
    foreach ($xp->query('//*[contains(@class, "foo") and (self::a or self::img)]') as $el) {
        echo $el->ownerDocument->saveHTML($el) . PHP_EOL;
    }
    

    Output:

    <img alt="äöü" class="foo" src="bar">
    <a class="foo" src="target2">link text 2</a>
    

    Demo on 3v4l.org

    Login or Signup to reply.
  2. Oh, you are SO LUCKY I read this. I have extensive experience with this EXACT problem, and I know JUST how deep this rabbit hole goes. (VERY)

    Fortunately, there is a simple solution… that I found after so. much. effort. Here it is:

    In order to properly support all of the modern and malformed HTML content that you might encounter, you’ll need the HTML Tidy binary:
    https://www.html-tidy.org/

    It’s a tiny, standalone binary that will correctly interpret and standardize all of the HTML that it encounters.

    Next, you’ll need a modern implementation of the DOMElement (and its related classes). I recommend MastermindsHTML5:
    https://github.com/Masterminds/html5-php

    This is available as a Composer package (composer require 'masterminds/html5').

    After that, you’re off the races!

    1. Use the HTML Tidy command to sanitize your HTML input
    2. Use the DOMElement to traverse and manipulate the DOM

    You can now use those DOM classes without any issues:
    https://www.php.net/manual/en/book.dom.php

    You’ll find that it can handle your exact use case very well!

    Here’s the class that I wrote for interfacing with the HTML Tidy binary:

    <?php
    
    namespace YourProject;
    
    use Exception;
    
    class TidyCommand
    {
        private $command;
    
        private static $STDIN = 0;
        private static $STDOUT = 1;
    
        public function __construct (string $executablePath, string $settingsPath)
        {
            $settingsPathArgument = escapeshellarg($settingsPath);
    
            $this->command = "{$executablePath} -config {$settingsPathArgument}";
        }
    
        public function run (string $stdin): string
        {
            $descriptors = [
                self::$STDIN => ['pipe', 'r'],
                self::$STDOUT => ['pipe', 'w']
            ];
    
            echo $this->command, "n";
            $process = proc_open($this->command, $descriptors, $pipes);
    
            if (!is_resource($process)) {
                throw new Exception("Invalid resource: {$this->command}");
            }
    
            fwrite($pipes[self::$STDIN], $stdin);
            fclose($pipes[self::$STDIN]);
    
            $stdout = stream_get_contents($pipes[self::$STDOUT]);
            fclose($pipes[self::$STDOUT]);
    
            proc_close($process);
    
            return $stdout;
        }
    }
    

    And here’s the matching configuration file:

    bare: no
    char-encoding: utf8
    clean: no
    coerce-endtags: yes
    drop-proprietary-attributes: yes
    hide-comments: yes
    newline: LF
    quiet: yes
    show-errors: 0
    show-info: no
    show-warnings: no
    strict-tags-attributes: yes
    tidy-mark: no
    wrap: 0
    

    At this point, you should be good to go!

    However, let me just say that if you succumb to temptation and attempt to write regular expressions to parse the HTML, then you are descending into a fiery pit of Hell. As you fall deeper and deeper into the W3C standards, you will eventually be consumed by them.

    Save yourself! Follow the one true path to salvation: outsource this burden to third-parties, who have literally spent decades trying to get it all perfect for you, and will likely spend decades more as HTML continues to evolve.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search