Regex PCRE2 (PHP >= 7.3)
I have a multi line string containing several <img>
tags.
Using regex, I want to capture:
a.) the whole img tags that contain a src attribute and
b.) the content of that src attribute.
- The src attribute may either be terminated with "" or ” or not surrounded by quotation marks at all.
- If quoted, the attribute should not contain " or ‘
- If unquoted, the attribute should not contain s or >
It took me a whole day to get this working somehow, but I need help to improve it.
The problem is that the quoted src attributes get captured in $matches[2] and the unquoted go to $matches[3]. Since I need all captured paths in one matching group, I’m copying the $matches[2] to $matches[3]. I’d rather have the captured data go directly to the same capturing group.
$code = <<<EOD
1. <img width src = one height>
2. <notAtStart><img src=two height>
3. NotAtAll
4. <img width src="three"><notAtEnd>
5. <notAtStart><img src = 'four' /><notAtEnd>
6. <img src =five><test>
7. <img WithoutSrc>
EOD;
$regex='/(<IMG(?=s).*sSRCs*=s*
(?(?=["'])
.(.+?) ["']
|
(.+?) [s>]
)
(?(?<!>).*?>)
)/ix';
preg_match_all($regex, $code, $matches);
echo PHP_EOL . "Matches:";
// print all groups:
print_r($matches);
// copy matches captures in $matches[2] to $matches[3]
foreach($matches[2] as $a=>$b)
if ($b != "")
$matches[3][$a] = $b;
// print the whole captured img tags:
print_r($matches[1]);
// print just the captured paths:
print_r($matches[3]);
Output:
Matches:Array
(
[0] => Array
(
[0] => <img width src = one height>
[1] => <img src=two height>
[2] => <img width src="three">
[3] => <img src = 'four' />
[4] => <img src =five>
)
[1] => Array
(
[0] => <img width src = one height>
[1] => <img src=two height>
[2] => <img width src="three">
[3] => <img src = 'four' />
[4] => <img src =five>
)
[2] => Array
(
[0] =>
[1] =>
[2] => three
[3] => four
[4] =>
)
[3] => Array
(
[0] => one
[1] => two
[2] =>
[3] =>
[4] => five
)
)
Array
(
[0] => <img width src = one height>
[1] => <img src=two height>
[2] => <img width src="three">
[3] => <img src = 'four' />
[4] => <img src =five>
)
Array
(
[0] => one
[1] => two
[2] => three
[3] => four
[4] => five
)
(And yes, I know, one shouldn’t use regex to scrape html at all, as it is not advised to do so.)
2
Answers
Ideally a parser would be used for this. Your regex could be updated to something like:
which should be closer to what you are trying to achieve. This uses 1 capture group rather than the two for the attribute content.
https://regex101.com/r/qYB9B7/1
As you already mention that you know that using a parser would be a better choice, here is another regex option.
If you want to either match up the quotes, or match non whitspace chars without quotes or angle brackets, you could also make use of a named capture group and the
J
flag to allow duplicate subpattern names.Explanation
<IMG(?=s)
Match<IMG
and assert a whitespace char to the right[^<>]*
Match optional chars other than<
or>
sSRCs*=s*
Match a whitespace char,SRC
and an equals sign between optional whitespace chars(?:
Non capture group for the 2 alternatives(['"])(?<att>[^'"]+)1
Capture group 1, capture either'
or"
and then match 1+ chars other than the quotes in between the same closing quote (using the backreference 1) in named groupatt
|
Or(?<att>[^s'"<>]+)
Match 1+ non whitespace chars other than'
"
<
>
in groupatt
)
Close the non capture group[^<>]*>
Match optional chars other than<
or>
and then match>
regex demo | PHP demo
Output
Then you could loop the
$matches
and get the value for theatt
key.Output