I’m using regex word boundary b, and I’m trying to match a word in the following sentence but the result is not what I need. Connector Punctuations (such as underscore) are not being considered as a word boundary
Sentence: ab﹎cd_de_gf|ij|kl|mn|op_
Regexp: \bkl\b
However, de
is not getting matched.
I tried updating the regexp to use unicode connector punctuation (it’s product requirement as we support CJK languages as well) but that isn’t working.
Regexp: (?<=\b|[p{Pc}])de(?=\b|[p{Pc}])
What am i missing here?
Note: (?<=\b|_)de(?=\b|_)
seems to work for underscores but i need the regex to work for all the connector punctuations.
Thanks in advance !!
2
Answers
To match any connector punctuation characters you need
p{Pc}
:NOTE:
p{Pc}
can also be written as[_u203Fu2040u2054uFE33uFE34uFE4D-uFE4FuFF3F]
that matches all these 10 chars.Based on the use case you have described you can simplify your regex to:
instead of trying to match word boundaries, unicode punctuation characters etc.
This will match
de
if it not followed or preceded by any alpha-numeric character.