skip to Main Content

I’m trying to write a PCRE regular expression to search PHP code to find strings in double-quotes, handling escaped double-quotes, and to exclude situations where double-quoted and single-quoted strings overlap, e.g. when building some HTML, such as these:

$str = '<elem prop="' . $var . '">';
$str = '<div class="my-class ' . $my_var_class . ' my-other-class">';

So far I’ve been able to come up with a reliable regex that handles escaped double-quotes:

"(.*?)(?<!\)"

This works for lines of code like these:

$str = "this is something";
$str = "this is {$another}";
$str = "could be {$hello['world']}";
$str = "and $hello[world] another";
$str = "'single quotes in double quotes'";
$str = "building <div style="width: 100%" data-var="{$var}"></div>";

But it doesn’t work for lines of code like my first example above; it would match "' . $var . '", but I don’t want it to match anything from that example line.

I’ve tried using the principles discussed at https://stackoverflow.com/a/62558215 and https://stackoverflow.com/a/6464500, but a look-ahead isn’t sufficient by itself, and I’m having a hard time coming up with a look-behind that doesn’t give me a compilation error about "lookbehind assertion is not fixed length". I feel like the answer at https://stackoverflow.com/a/36186925/3404349 might (?) be getting close to what I’m looking for, but it seems to me that it’s matching the inverse (of sorts) of my goal.

2

Answers


  1. Chosen as BEST ANSWER

    Huge thanks to @Michail for the comment on the question that got me on the right track. I used that suggestion and developed it further to also handle inline and block comments (which may contain a "orphaned" single- or double-quote, thus inverting the desired matching).

    /*.*?*/(*SKIP)^|//.*?$(*SKIP)^|'(?>\?.)*?'(*SKIP)^|"(?>\?.)*?"
    

    Demo

    Note that the m, and s flags are pretty important for this to work.

    Here's a break-down of how this works as far as I understand it.

    In this use case, there are four alternatives separated by the alternation pipe (|). Any start/end pair that we don't want to keep/match should come first in the list of alternations because of how (*SKIP) works.

    .*? and (?>\?.)*?: This is used to match 0 or more of any character in between the respective start and end markers. In the second case, it also specifically includes an optional backslash to handle cases of escaped characters within strings. The second case uses an atomic group, and I'm not 100% sure why, but I know it prevents backtracking, which seems to be important for this.

    (*SKIP)^: This is a clever pair placed after each end marker. (*SKIP) basically says if something after this point causes us to go backward in the string, just discard it and keep moving forward. ^ immediately after that is the "start of the line" anchor, which means that after the respective preceding start and end pair have been found, just discard the whole thing and keep moving forward (because you can't have the beginning of the string occur immediately following the end of the match).

    /* and */ match the beginning and end of block comments.

    // and $ match the beginning of an inline comment to the end of the line.

    The pairs of ' and " each match the start and end of their respective type of string.

    Since the last alternation does not include (*SKIP), it's the only one that gets matched and returned.


  2. You can do it using Negative Lookbehind (?<!')

    "(?<!')(.*?)(?<!\)(?<!')"
    

    Demo here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search