I’m currently facing an issue with a regex that’s causing problems in SonarQube, which requires grouping to make the intended operator precedence explicit. After grouping the regex, it’s not working as expected.
SonarQube Issue:
SonarQube flags that the regex should have grouped parts to make the operator precedence clear.
Current Regex: /^(W)|([^a-zA-Z0-9_.]+)|(W)$/g
This regex is meant to validate a string based on the following conditions:
Requirements:
- If the string contains dot(s) at the beginning or end, it should throw an error immediately.
- If the string contains any symbols apart from A-Z, a-z, 0-9, underscore, or dot (where dots can only appear in between), it should throw an error.
- The string should only contain A-Z, a-z, 0-9, underscore, or dots (dots can’t appear at the start or end but are allowed in between).
Note:
The existing logic is designed to throw an error if the regex matches. Therefore, I need a regex that negates the conditions mentioned above without modifying the existing logic, as it’s part of a reusable codebase.
I attempted the following regex /^(.)|([^a-zA-Z0-9_.]+)|(.*.$)/g, but I’m concerned this might still cause SonarQube issues due to operator precedence.
How can I properly structure this regex to meet these conditions and avoid SonarQube warnings?
2
Answers
Regex:
Explanation:
Anchors
^
and$
:^
at the start ensures that the pattern matches from the beginning of the string.$
at the end ensures that the pattern matches until the end of the string.Optional starting dot
(^.?)
:.?
matches an optional dot at the beginning of the string. This allows strings to start with a dot, but it is not required.Character class in the middle
([^p{L}_.rn]+)
:([^p{L}_.rn]+)
capture group that matches one or more characters that are not in the specified set:^p{L}
means anything that is not a character, it’s equivalent toa-zA-Z
but it also excludes accented characters likeé
orä
._
Excludes underscores..
Excludes dots.rn
Optional (depends if you parse a text with many lines or only 1 line string): excludes newline and carriage return characters to prevent capturing line breaks.Required ending dot
(.$)
:.$
ensures that the string ends with a dot.Pipe
|
operator for alternation:|
, which means OR, allowing for two different valid patterns:^.?([^p{L}_.rn]+).$
Matches a string that optionally starts with a dot, has valid characters in the middle (excluding dots, underscores, letters), and ends with a dot.^.([^p{L}_.rn]+).?$
Matches a string that starts with a dot, has valid characters in the middle, and optionally ends with a dot.Options
/gmu
:g
: global to get all the matchs not just the first.m
: to match multi-line if the input text as many lines.u
: unicode to be able to usep{L}
Why not capturing 3 groups like in your example
Trying to capturing 3 groups meaning the middle and the two dots like below is bad because the number of capturing groups is not fixed (could be 2 or 3). And further in your code you will have to deal with this variable number of captured groups.
Anyway here is how you could do it:
Tip
If the goal is to get only the whole match, just remove the capture groups:
Tested with
Non matching examples
Matching examples
Tests links
regex101
Your current regex is correct: it will find a match when the input is not in line with the requirements.
The SonarQube warning you refer to is probably RSPEC-5850: Alternatives in regular expressions should be grouped when used with anchors
This rule tackles a common mistake that is made when combining
^
or$
with|
. However, this is not a mistake that you have made. To make absolutely clear that you intended the^
to only apply to the first alternative (and not all of them), and the$
to only apply to the last alternative (and not all of them), the suggestion here is to put^
inside a group, and to do the same for$
. Your current regex still leaves those out of the groups.Note that you don’t really need to put the middle alternative in a group, as there you don’t use the
^
or$
assertions.Secondly, the suggestion is not to make capture groups, but just groups. So use
(?: )
instead of( )
, and make sure you put^
and$
inside them.Not related, but your regex doesn’t need the
+
quantifier. If one such character is found, it is enough. It doesn’t matter if you find more than one consecutive invalid character. Also, you can usew
to shorten the character class.Applying these changes, we get: