skip to Main Content

I’m currently facing an issue with a regex that’s causing problems in SonarQube, which requires grouping to make the intended operator precedence explicit. After grouping the regex, it’s not working as expected.

SonarQube Issue:
SonarQube flags that the regex should have grouped parts to make the operator precedence clear.

Current Regex: /^(W)|([^a-zA-Z0-9_.]+)|(W)$/g
This regex is meant to validate a string based on the following conditions:

Requirements:

  • If the string contains dot(s) at the beginning or end, it should throw an error immediately.
  • If the string contains any symbols apart from A-Z, a-z, 0-9, underscore, or dot (where dots can only appear in between), it should throw an error.
  • The string should only contain A-Z, a-z, 0-9, underscore, or dots (dots can’t appear at the start or end but are allowed in between).

Note:
The existing logic is designed to throw an error if the regex matches. Therefore, I need a regex that negates the conditions mentioned above without modifying the existing logic, as it’s part of a reusable codebase.

I attempted the following regex /^(.)|([^a-zA-Z0-9_.]+)|(.*.$)/g, but I’m concerned this might still cause SonarQube issues due to operator precedence.

How can I properly structure this regex to meet these conditions and avoid SonarQube warnings?

2

Answers


  1. Regex:

    /^.?([^p{L}_.rn]+).$|^.([^p{L}_.rn]+).?$/gmu
    

    Explanation:

    1. Anchors ^ and $:

      • ^ at the start ensures that the pattern matches from the beginning of the string.
      • $ at the end ensures that the pattern matches until the end of the string.
    2. Optional starting dot (^.?):

      • .? matches an optional dot at the beginning of the string. This allows strings to start with a dot, but it is not required.
    3. Character class in the middle ([^p{L}_.rn]+):

      • ([^p{L}_.rn]+) capture group that matches one or more characters that are not in the specified set:
        • ^p{L} means anything that is not a character, it’s equivalent to a-zA-Z but it also excludes accented characters like é or ä.
        • _ Excludes underscores.
        • . Excludes dots.
        • rn Optional (depends if you parse a text with many lines or only 1 line string): excludes newline and carriage return characters to prevent capturing line breaks.
    4. Required ending dot (.$):

      • .$ ensures that the string ends with a dot.
    5. Pipe | operator for alternation:

      • The regex uses |, which means OR, allowing for two different valid patterns:
        • ^.?([^p{L}_.rn]+).$ Matches a string that optionally starts with a dot, has valid characters in the middle (excluding dots, underscores, letters), and ends with a dot.
        • ^.([^p{L}_.rn]+).?$ Matches a string that starts with a dot, has valid characters in the middle, and optionally ends with a dot.
    6. Options /gmu:

      • g : global to get all the matchs not just the first.
      • m : to match multi-line if the input text as many lines.
      • u : unicode to be able to use p{L}

    Why not capturing 3 groups like in your example

    Trying to capturing 3 groups meaning the middle and the two dots like below is bad because the number of capturing groups is not fixed (could be 2 or 3). And further in your code you will have to deal with this variable number of captured groups.

    Anyway here is how you could do it:

    /^(.)?([^p{L}_.rn]+)(.)$|^(.)([^p{L}_.rn]+)(.?)$/gmu
    

    Tip

    If the goal is to get only the whole match, just remove the capture groups:

    /^.?[^p{L}_.rn]+.$|^.[^p{L}_.rn]+.?$/gmu
    

    Tested with

    Non matching examples

    a!@#.
    .!@_.
    abc.def
    !!@!
    .é*"+.
    

    Matching examples

    .!@#$.
    .!@#$
    .@@.
    .!#$.
    

    Tests links

    regex101

    Login or Signup to reply.
  2. Your current regex is correct: it will find a match when the input is not in line with the requirements.

    The SonarQube warning you refer to is probably RSPEC-5850: Alternatives in regular expressions should be grouped when used with anchors

    This rule tackles a common mistake that is made when combining ^ or $ with |. However, this is not a mistake that you have made. To make absolutely clear that you intended the ^ to only apply to the first alternative (and not all of them), and the $ to only apply to the last alternative (and not all of them), the suggestion here is to put ^ inside a group, and to do the same for $. Your current regex still leaves those out of the groups.

    Note that you don’t really need to put the middle alternative in a group, as there you don’t use the ^ or $ assertions.

    Secondly, the suggestion is not to make capture groups, but just groups. So use (?: ) instead of ( ), and make sure you put ^ and $ inside them.

    Not related, but your regex doesn’t need the + quantifier. If one such character is found, it is enough. It doesn’t matter if you find more than one consecutive invalid character. Also, you can use w to shorten the character class.

    Applying these changes, we get:

    /(?:^W)|[^w.]|(?:W$)/g
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search