skip to Main Content

I want a regex that can match spaces as long as they occur outside of parentheses and quotes. For example, it would match the spaces after the 0, 3, 6, and 9 in 0 " 1 2 "3 ( 4 5 )6 ( ( 7 ) ( "("8 ) )9 " ( A B C ) ) ( D E F ( ".

I made a simple regex the doesn’t care about parentheses:

s+(?=(?:[^"]*"[^"]*")*[^"]*$)

However, I don’t how to make a regex that checks for parentheses, because they can be nested.

Bonus points if it somehow checks for unmatched parentheses and quotes (in JavaScript or pseudocode).

2

Answers


  1. tl;dr

    /s++|([^s"()]++|"[^"]*+"|(((?:[^"()]++|"[^"]*+"|(?-1))*+)))++(*SKIP)(*FAIL)/gs
    

    Explanation

    Contrary to the common belief, regex is capable of matching nested parentheses, using recursive subroutine calls [https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC25].

    First step

    Let us first have a regex that matches the opposite of what we want:

    (
      [^s"()]++
    |
      "[^"]*+"
    |
      ((?:s++|(?&-1))*+)
    )++
    

    Inside the capture group there is an alternation:

    1. The first branch matches anything that is not a space, quote mark, or parenthesis. (++ means as many as possible and do not backtrack once found.)
    2. The second branch matches a quoted expression.
    3. The third one is tricky, it matches something enclosed in parentheses. It is either one or more spaces, or anything matched by the previous capture group ((?&-1) is the recursive subroutine call).

    Second step

    We can now find anything which is either space or something of the previous:

    /s++|([^s"()]++|"[^"]*+"|(((?:[^"()]++|"[^"]*+"|(?-1))*+)))++/gs
    

    Matches are either one-or-more unquoted/unparenthesized spaces or the opposite expression, and the two kinds are alternating.

    Third step

    If we found unquoted/unparenthesized spaces, we are done. But if we have found the expression we don’t want, we can tell the search engine to reject it and retry matching the subject from where we have stopped. (*SKIP) will mark this position, and (*FAIL) will reject the match.

    /s++|([^s"()]++|"[^"]*+"|(((?:[^"()]++|"[^"]*+"|(?-1))*+)))++(*SKIP)(*FAIL)/gs
    

    Extra

    The first branch matches spaces, the second branch matches correctly nested parentheses and quotes. What else are left? Unmatched parentheses and quotes. Let’s put them in a named capture group

    /s++|([^s"()]++|"[^"]*+"|(((?:[^"()]++|"[^"]*+"|(?-1))*+)))++(*SKIP)(*FAIL)|(?<error>.+)/gs
    

    If the capture group named ‘error’ is present in the matches, we have unmatched parentheses and quotes.

    Login or Signup to reply.
  2. Use a simple parser to find the spaces (adapt for mailformed strings (no matching quotes or brackets)):

    const str = '0 " 1 2 "3 ( 4 5 )6 ( ( 7 ) ( "("8 ) )9 " ( A B C ) ) ( D E F ( "';
    
    const findSpaces = (str, inside = false, from = 0, out = []) => {
    
        for (let i = from; i < str.length; i++) {
            const c = str[i];
            if (inside && c === ')') return i;
            !inside && c === ' ' && out.push(i) ||
            c === '"' && (i = str.indexOf('"', i + 1)) ||
            c === '(' && (i = findSpaces(str, true, i + 1, out));
        }
        return out;
    
    };
    
    console.log(findSpaces(str).map(i => `after ${str[i-1]}`));
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search