skip to Main Content

I’m using regex word boundary b, and I’m trying to match a word in the following sentence but the result is not what I need. Connector Punctuations (such as underscore) are not being considered as a word boundary

Sentence: ab﹎cd_de_gf|ij|kl|mn|op_

Regexp: \bkl\b

However, de is not getting matched.

I tried updating the regexp to use unicode connector punctuation (it’s product requirement as we support CJK languages as well) but that isn’t working.

Regexp: (?<=\b|[p{Pc}])de(?=\b|[p{Pc}])

What am i missing here?

Note: (?<=\b|_)de(?=\b|_) seems to work for underscores but i need the regex to work for all the connector punctuations.

Thanks in advance !!

2

Answers


  1. To match any connector punctuation characters you need p{Pc}:

    (?<=\b|\p{Pc})de(?=\b|\p{Pc})
    

    NOTE: p{Pc} can also be written as [_u203Fu2040u2054uFE33uFE34uFE4D-uFE4FuFF3F] that matches all these 10 chars.

    Login or Signup to reply.
  2. Based on the use case you have described you can simplify your regex to:

    (?<![[:alnum:]])de(?![[:alnum:]])
    

    instead of trying to match word boundaries, unicode punctuation characters etc.

    This will match de if it not followed or preceded by any alpha-numeric character.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search