Regex match "_" char only if it isn't in a username - Telegram API

LeonardoScotti
January 17, 2021
214 views
3 votes
4 Answers

Context and Explaination

I am doing a telegram bot, and i want to add the excape char "" before every "_" char that is not in a username (a word starting with "@") like "@username_", to prevent some markdown errors (in fact in telegram the "_" char is used to make a string italic).

So, for example, having this string:

"hello i like this char _ write me lol_ @myusername_"

i want to be matched only the first two "_" chars but not the third

Question

what’s the correct way to do this with a regex pattern?

Expected Conditions and Matching

Condition	Match
`"_"` alone: (`"_"`)	YES
`"_"` in a word without `"@"`: (`"lol_"`)	YES
`"_"` in a word starting with `"@"`: (`"@username_"`)	NO
`"_"` in a word containing `"@"` after the `"@"`: (`"lol@username_"`)	NO
`"_"` in a word containing `"@"` before the `"@"`: (`"lol_@username"`)	YES
`"_"` in a world like: (`"lol_@username_"`)	first: YES second: NO

What i have tried

so far i arrived at this, but it does not work properly:

"(?=[^@]+)(?:s[^s]*(_)[^s]*s)"

EDIT

I also want that in this string: "lol_@username_" the first char "_" to be matched

Answers

I assume you only care about @ being at the start of a word. You can use re.sub along with replace and (?:s|^)[^@]S+b to match the words that fit your spec:

import re

s = "hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:s|^)[^@]S*b", lambda x: x.group().replace("_", r"_"), s)
print(s) # => hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf

If you care about @ appearing anywhere in a word, try (?:s|^)[^@s]+b:

s = "he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
s = re.sub(r"(?:s|^)[^@s]+b", lambda x: x.group().replace("_", r"_"), s)
print(s) # => he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf

Per OP comment, sounds like the latest spec is to escape _ that are anywhere except after @ in a word:

>>> s = "he_llo i lol_@username_ _ write me lol_ @myusername_ asd@_a @_asdf"
>>> re.sub(r"(?:s|^)[^@]+@", lambda x: x.group().replace("_", r"_"), s)
'he\_llo i lol\_@username_ \_ write me lol\_ @myusername_ asd@_a @_asdf'

Extract with PyPi regex library:

import regex
string = "hello i like this char _ write me lol_ @myusername_"
print(regex.findall(r'(?<!S)@w+(*SKIP)(*F)|_', string))
# ['_', '_']

See Python proof.

Explanation

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    S                       non-whitespace (all but n, r, t, f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  @                        '@'
--------------------------------------------------------------------------------
  w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount  possible))
--------------------------------------------------------------------------------
  (*SKIP)(*F)              skip the match, search from the failure location
--------------------------------------------------------------------------------
  |                        or
--------------------------------------------------------------------------------
  _                        a '_' char

Remove with re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!S)(@w+)|_', r'1', string))
# hello i like this char  write me lol @myusername_

See Python proof.

Replace with re:

import re
string = "hello i like this char _ write me lol_ @myusername_"
print(re.sub(r'(?<!S)(@w+)|_', lambda x: x.group(1) or "-", string))
# hello i like this char - write me lol- @myusername_

See another Python proof.

- Thefourthbird
- January 18, 2021 at 10:53 am
- 0 votes
0
You could match all non whitspace chars after matching @ and capture the _ in a group using an alternation. If the callback of re.sub, check if group 1 exists.

If it does, return an escaped underscore or the excaped group 1 value (which is also an underscore), else return the match to leave it unchanged.
```
@S+|(_)
```
Regex demo
```
import re

strings = [
    "_",
    "lol_",
    "@username_",
    "lol@username_",
    "lol_@username",
    "lol_@username_"
]

for s in strings:
    result = re.sub(
        r"@S+|(_)",
        lambda x: x.group(1).replace("_", r"_") if x.group(1) else x.group(),
        s
    )
    print(result)
```
Output
```
_
lol_
@username_
lol@username_
lol_@username
lol_@username_
```
Login or Signup to reply.

- Philip
- January 18, 2021 at 12:29 pm
- 0 votes
0
Based on @OlvinRoght’s comment, with a small edit, this should do the trick:

Regex

((?:^|s)(?:[^@s]*?))(_)((?:[^@s]*?))(?=@|s|$)

Code example
```
import re

text = '_hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_'

regex = r"((?:^|s)(?:[^@s]*?))(_)((?:[^@s]*?))(?=@|s|$)"

# Leave the first and last capturing group as-is and replace the underscore with '_'
subst = "\1\\_\3"

print( re.sub(regex, subst, text) )
```
Expected output:
```
_hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_
```
Demo

See it live

Note:

Although this works, @TheFourthBird’s answer is faster. (And more elegant I think.)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Regex match "_" char only if it isn't in a username – Telegram API

Answers