I have a text string which contains a repeating pattern, each repetion separated by the next by the .
(dot) character. The pattern may
end in a _123
(underscore followed by a sequence of digits), and I want to catch those digits in a dedicated capturing group.
The RegEx (ECMAScript) I have built mostly works:
https://regex101.com/r/iEzalU/1
/(label(:|+))?(w+)(?:_(d+))?/gi
However, the (w+)
part acts greedy, and overtakes the (?:_(d+))?
part.
Adding a ?
to make w+
non-greedy (w+?)
works, but now I have a capturing token for each character matched by w
How can I make this regex such that w+
acts greedy but still does not overtake the _(d+)
part?
Otherwise, is it possible to capture all tokens matched by the non-greedy w+?
, as a single match? (some capturing/non-capturing groups magic?)
2
Answers
When creating regular expressions, it is a good idea to think about your expected match boundaries.
You know you need to match substrings in a longer string, so
$
andz
can be excluded at once. Digits, letters, underscores are all word characters matched withw
, so you want to match all up to a character other than a word character (or, potentially, till the end of string).I suggest using
See the regex demo
Details:
(label[:+])?
– an optional Group 1:label
and then a:
or+
(w+?)
– Group 2: one or more word chars as few as possible(?:_(d+))?
– an optional sequence of:_
and then one or more digits captured into Group 3b
– the next char can only be a non-word char or end of string should follow.You can also get the desired result with a simpler regular expression that focuses on the finishing pattern of each group:
This may or may not be preceded by a
label[+:]
group, but that does not need to be expressed in the regular expression.See the little demo I modified from Wiktor Stribizew’s example:
https://regex101.com/r/pO7OdW/1
Or as a snippet: