I want a regex that can match spaces as long as they occur outside of parentheses and quotes. For example, it would match the spaces after the 0
, 3
, 6
, and 9
in 0 " 1 2 "3 ( 4 5 )6 ( ( 7 ) ( "("8 ) )9 " ( A B C ) ) ( D E F ( "
.
I made a simple regex the doesn’t care about parentheses:
s+(?=(?:[^"]*"[^"]*")*[^"]*$)
However, I don’t how to make a regex that checks for parentheses, because they can be nested.
Bonus points if it somehow checks for unmatched parentheses and quotes (in JavaScript or pseudocode).
2
Answers
tl;dr
Explanation
Contrary to the common belief, regex is capable of matching nested parentheses, using recursive subroutine calls [https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC25].
First step
Let us first have a regex that matches the opposite of what we want:
Inside the capture group there is an alternation:
(?&-1)
is the recursive subroutine call).Second step
We can now find anything which is either space or something of the previous:
Matches are either one-or-more unquoted/unparenthesized spaces or the opposite expression, and the two kinds are alternating.
Third step
If we found unquoted/unparenthesized spaces, we are done. But if we have found the expression we don’t want, we can tell the search engine to reject it and retry matching the subject from where we have stopped.
(*SKIP)
will mark this position, and(*FAIL)
will reject the match.Extra
The first branch matches spaces, the second branch matches correctly nested parentheses and quotes. What else are left? Unmatched parentheses and quotes. Let’s put them in a named capture group
If the capture group named ‘error’ is present in the matches, we have unmatched parentheses and quotes.
Use a simple parser to find the spaces (adapt for mailformed strings (no matching quotes or brackets)):