skip to Main Content

I’m trying to create a perl regular expression that matches a URL that is not preceded by an equal sign and one single or double quote (optional) ignoring whitespace. The code below gives an error: Warning: preg_replace(): Compilation failed: lookbehind assertion is not fixed length at offset 0

I know my URL regular expression isn’t perfect, but I’m more focused on how to do the negative lookbehind or how to express this in some other way.

For example, in the code below, in the matches, it should output http://www.url1.com/ and http://www.url3.com/, but not the other URLs. How can I do this? The code below gives a warning and does not populate the $matches variable.

PHP Code:

$html = "
http://www.url1.com/
= ' http://www.url2.com/
'http://www.url3.com/
<a href='http://www.url4.com/'>Testing1</a>
<img src='https://url5.com'>Testing2</a>";

$url_pregex = '((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)';
$pregex = '(?<!\s*=\s*['"]?\s*)'.$url_pregex;

preg_match_all('`'.$pregex.'`i', $html, $matches);

echo "Matches<br><pre>";
var_export($matches);
echo "</pre>";

Perl Regex in PHP, using ` instead of /:

'`(?<!\s*=\s*['"]?\s*)((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)`i'

2

Answers


  1. One way to work around this is to use an alternation, the first part of which matches URLs which are preceded by = (and an optional quote), and the second which just matches URLs which are then captured. This works because the first part of an alternation is always tested first and so only URLs which are not preceded by = will be captured by the second part of the alternation.

    I’ve removed capture groups from your $url_pregex for simplicity; if you want them in you’ll need to adjust the group number on $matches in this code to get the complete matches.

    $html = "
    http://www.url1.com/
    = ' http://www.url2.com/
    'http://www.url3.com/
    <a href='http://www.url4.com/'>Testing1</a>
    <img src = 'https://url5.com'>Testing2</a>";
    
    $url_pregex = 'https?://[-a-zA-Z()0-9@:%_+.~#?&;//=]+';
    $pregex = "\s*=\s*['"]?\s*$url_pregex|($url_pregex)";
    
    preg_match_all('`' . $pregex . '`i', $html, $matches);
    
    echo "Matches<br><pre>";
    var_export(array_values(array_filter($matches[1])));
    echo "</pre>";
    

    Output:

    Matches<br><pre>array (
      0 => 'http://www.url1.com/',
      1 => 'http://www.url3.com/',
    )</pre>
    

    Demo on 3v4l.org

    Note that you need to use preg_match_all to get all matches in the text.

    Login or Signup to reply.
  2. You should not bloat your output array with unwanted/disqualifiable matches as shown in Nick’s answer (then there will be no need to mop up with array_filter() and array_values()).

    To consume and discard matches, use (*SKIP)(*FAIL).

    I’ve taken the liberty of tuning up your regex pattern while implementing my advice.

    Code: (Demo)

    $url_regex = 'https?://[-w()@:%+.~#?&;/=]+';
    $regex = "`s*=s*['"]?s*$url_regex(*SKIP)(*FAIL)|$url_regex`i";
    
    var_export(preg_match_all($regex, $html, $matches) ? $matches[0] : []);
    

    Output:

    array (
      0 => 'http://www.url1.com/',
      1 => 'http://www.url3.com/',
    )
    

    If you are actually trying to parse valid HTML, then using a regex is not likely to be the most appropriate tool.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search