skip to Main Content

Regex PCRE2 (PHP >= 7.3)

I have a multi line string containing several <img> tags.

Using regex, I want to capture:

a.) the whole img tags that contain a src attribute and
b.) the content of that src attribute.

  • The src attribute may either be terminated with "" or ” or not surrounded by quotation marks at all.
  • If quoted, the attribute should not contain " or ‘
  • If unquoted, the attribute should not contain s or >

It took me a whole day to get this working somehow, but I need help to improve it.
The problem is that the quoted src attributes get captured in $matches[2] and the unquoted go to $matches[3]. Since I need all captured paths in one matching group, I’m copying the $matches[2] to $matches[3]. I’d rather have the captured data go directly to the same capturing group.

$code = <<<EOD
1. <img width src = one height>
2. <notAtStart><img src=two height>
3. NotAtAll
4. <img width src="three"><notAtEnd>
5. <notAtStart><img src = 'four' /><notAtEnd>
6. <img src =five><test>
7. <img WithoutSrc>
EOD;

$regex='/(<IMG(?=s).*sSRCs*=s*
    (?(?=["'])
        .(.+?) ["'] 
     |         
        (.+?) [s>] 
    )
    (?(?<!>).*?>)
)/ix';

preg_match_all($regex, $code, $matches);
echo PHP_EOL . "Matches:";
// print all groups:
print_r($matches);

// copy matches captures in $matches[2] to $matches[3]
foreach($matches[2] as $a=>$b) 
    if ($b != "")
        $matches[3][$a] = $b;

// print the whole captured img tags:
print_r($matches[1]);
// print just the captured paths:
print_r($matches[3]);

Output:

Matches:Array
(
    [0] => Array
        (
            [0] => <img width src = one height>
            [1] => <img src=two height>
            [2] => <img width src="three">
            [3] => <img src = 'four' />
            [4] => <img src =five>
        )

    [1] => Array
        (
            [0] => <img width src = one height>
            [1] => <img src=two height>
            [2] => <img width src="three">
            [3] => <img src = 'four' />
            [4] => <img src =five>
        )

    [2] => Array
        (
            [0] =>
            [1] =>
            [2] => three
            [3] => four
            [4] =>
        )

    [3] => Array
        (
            [0] => one
            [1] => two
            [2] =>
            [3] =>
            [4] => five
        )

)
Array
(
    [0] => <img width src = one height>
    [1] => <img src=two height>
    [2] => <img width src="three">
    [3] => <img src = 'four' />
    [4] => <img src =five>
)
Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => four
    [4] => five
)

(And yes, I know, one shouldn’t use regex to scrape html at all, as it is not advised to do so.)

2

Answers


  1. Ideally a parser would be used for this. Your regex could be updated to something like:

    <IMG(?=s).*sSRCs*=s*(['"])?(.+?)(?:1|>|s)
    

    which should be closer to what you are trying to achieve. This uses 1 capture group rather than the two for the attribute content.

    https://regex101.com/r/qYB9B7/1

    Login or Signup to reply.
  2. As you already mention that you know that using a parser would be a better choice, here is another regex option.

    If you want to either match up the quotes, or match non whitspace chars without quotes or angle brackets, you could also make use of a named capture group and the J flag to allow duplicate subpattern names.

    <IMG(?=s)[^<>]*sSRCs*=s*(?:(['"])(?<att>[^'"]+)1|(?<att>[^s'"<>]+))[^<>]*>
    

    Explanation

    • <IMG(?=s) Match <IMG and assert a whitespace char to the right
    • [^<>]* Match optional chars other than < or >
    • sSRCs*=s* Match a whitespace char, SRC and an equals sign between optional whitespace chars
    • (?: Non capture group for the 2 alternatives
      • (['"])(?<att>[^'"]+)1 Capture group 1, capture either ' or " and then match 1+ chars other than the quotes in between the same closing quote (using the backreference 1) in named group att
      • | Or
      • (?<att>[^s'"<>]+) Match 1+ non whitespace chars other than ' " < > in group att
    • ) Close the non capture group
    • [^<>]*> Match optional chars other than < or > and then match >

    regex demo | PHP demo

    $re = '/<IMG(?=s)[^<>]*sSRCs*=s*(?:(['"])(?<att>[^'"]+)1|(?<att>[^s'"<>]+))[^<>]*>/iJ';
    $str = '<img width src = one height>
    2. <notAtStart><img src=two height>
    3. NotAtAll
    4. <img width src="three"><notAtEnd>
    5. <notAtStart><img src = 'four' /><notAtEnd>
    6. <img src =five><test>
    7. <img WithoutSrc>';
    
    preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
    print_r($matches);
    

    Output

    Array
    (
        [0] => Array
            (
                [0] => <img width src = one height>
                [1] => 
                [att] => one
                [2] => 
                [3] => one
            )
    
        [1] => Array
            (
                [0] => <img src=two height>
                [1] => 
                [att] => two
                [2] => 
                [3] => two
            )
    
        [2] => Array
            (
                [0] => <img width src="three">
                [1] => "
                [att] => three
                [2] => three
            )
    
        [3] => Array
            (
                [0] => <img src = 'four' />
                [1] => '
                [att] => four
                [2] => four
            )
    
        [4] => Array
            (
                [0] => <img src =five>
                [1] => 
                [att] => five
                [2] => 
                [3] => five
            )
    
    )
    

    Then you could loop the $matches and get the value for the att key.

    foreach ($matches as $m) {
        echo $m["att"] . PHP_EOL;
    }
    

    Output

    one
    two
    three
    four
    five
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search