skip to Main Content

I have a text string which contains a repeating pattern, each repetion separated by the next by the . (dot) character. The pattern may end in a _123 (underscore followed by a sequence of digits), and I want to catch those digits in a dedicated capturing group.

The RegEx (ECMAScript) I have built mostly works:
https://regex101.com/r/iEzalU/1

/(label(:|+))?(w+)(?:_(d+))?/gi

However, the (w+) part acts greedy, and overtakes the (?:_(d+))? part.

Regex with Greedy behavior

Adding a ? to make w+ non-greedy (w+?) works, but now I have a capturing token for each character matched by w

Regex with non-greedy behavior

How can I make this regex such that w+ acts greedy but still does not overtake the _(d+) part?
Otherwise, is it possible to capture all tokens matched by the non-greedy w+?, as a single match? (some capturing/non-capturing groups magic?)

2

Answers


  1. When creating regular expressions, it is a good idea to think about your expected match boundaries.

    You know you need to match substrings in a longer string, so $ and z can be excluded at once. Digits, letters, underscores are all word characters matched with w, so you want to match all up to a character other than a word character (or, potentially, till the end of string).

    I suggest using

    (label[:+])?(w+?)(?:_(d+))?b
    

    See the regex demo

    Details:

    • (label[:+])? – an optional Group 1: label and then a : or +
    • (w+?) – Group 2: one or more word chars as few as possible
    • (?:_(d+))? – an optional sequence of: _ and then one or more digits captured into Group 3
    • b – the next char can only be a non-word char or end of string should follow.
    Login or Signup to reply.
  2. You can also get the desired result with a simpler regular expression that focuses on the finishing pattern of each group:

    /(w+?)_(d+)(?:.|$)/gi
    

    This may or may not be preceded by a label[+:] group, but that does not need to be expressed in the regular expression.

    See the little demo I modified from Wiktor Stribizew’s example:

    https://regex101.com/r/pO7OdW/1

    Or as a snippet:

    console.log([..."group_12.label:sub_1.field_23.label+long_field.label:another.label+long_field_345".matchAll(/(w+?)_(d+)(?:.|$)/gi)].map(r=>r.slice(1)))
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search