I’m trying to create a perl regular expression that matches a URL that is not preceded by an equal sign and one single or double quote (optional) ignoring whitespace. The code below gives an error: Warning: preg_replace(): Compilation failed: lookbehind assertion is not fixed length at offset 0
I know my URL regular expression isn’t perfect, but I’m more focused on how to do the negative lookbehind or how to express this in some other way.
For example, in the code below, in the matches, it should output http://www.url1.com/ and http://www.url3.com/, but not the other URLs. How can I do this? The code below gives a warning and does not populate the $matches variable.
PHP Code:
$html = "
http://www.url1.com/
= ' http://www.url2.com/
'http://www.url3.com/
<a href='http://www.url4.com/'>Testing1</a>
<img src='https://url5.com'>Testing2</a>";
$url_pregex = '((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)';
$pregex = '(?<!\s*=\s*['"]?\s*)'.$url_pregex;
preg_match_all('`'.$pregex.'`i', $html, $matches);
echo "Matches<br><pre>";
var_export($matches);
echo "</pre>";
Perl Regex in PHP, using ` instead of /:
'`(?<!\s*=\s*['"]?\s*)((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)`i'
2
Answers
One way to work around this is to use an alternation, the first part of which matches URLs which are preceded by
=
(and an optional quote), and the second which just matches URLs which are then captured. This works because the first part of an alternation is always tested first and so only URLs which are not preceded by=
will be captured by the second part of the alternation.I’ve removed capture groups from your
$url_pregex
for simplicity; if you want them in you’ll need to adjust the group number on$matches
in this code to get the complete matches.Output:
Demo on 3v4l.org
Note that you need to use
preg_match_all
to get all matches in the text.You should not bloat your output array with unwanted/disqualifiable matches as shown in Nick’s answer (then there will be no need to mop up with
array_filter()
andarray_values()
).To consume and discard matches, use
(*SKIP)(*FAIL)
.I’ve taken the liberty of tuning up your regex pattern while implementing my advice.
Code: (Demo)
Output:
If you are actually trying to parse valid HTML, then using a regex is not likely to be the most appropriate tool.