skip to Main Content

Context and Explaination

I am doing a telegram bot, and i want to add the excape char "" before every "_" char that is not in a username (a word starting with "@") like "@username_", to prevent some markdown errors (in fact in telegram the "_" char is used to make a string italic).

So, for example, having this string:

"hello i like this char _ write me lol_ @myusername_"

i want to be matched only the first two "_" chars but not the third


Question

what’s the correct way to do this with a regex pattern?


Expected Conditions and Matching

Condition Match
"_" alone: ("_") YES
"_" in a word without "@": ("lol_") YES
"_" in a word starting with "@": ("@username_") NO
"_" in a word containing "@" after the "@": ("lol@username_") NO
"_" in a word containing "@" before the "@": ("lol_@username") YES
"_" in a world like: ("lol_@username_") first: YES second: NO

What i have tried

so far i arrived at this, but it does not work properly:

"(?=[^@]+)(?:s[^s]*(_)[^s]*s)"

EDIT

I also want that in this string: "lol_@username_" the first char "_" to be matched

4

Answers


  1. I assume you only care about @ being at the start of a word. You can use re.sub along with replace and (?:s|^)[^@]S+b to match the words that fit your spec:

    import re
    
    s = "hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
    s = re.sub(r"(?:s|^)[^@]S*b", lambda x: x.group().replace("_", r"_"), s)
    print(s) # => hello i like this char _ write me lol_ @myusername_ asd@_a @_asdf
    

    If you care about @ appearing anywhere in a word, try (?:s|^)[^@s]+b:

    s = "he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf"
    s = re.sub(r"(?:s|^)[^@s]+b", lambda x: x.group().replace("_", r"_"), s)
    print(s) # => he_llo i like this char _ write me lol_ @myusername_ asd@_a @_asdf
    

    Per OP comment, sounds like the latest spec is to escape _ that are anywhere except after @ in a word:

    >>> s = "he_llo i lol_@username_ _ write me lol_ @myusername_ asd@_a @_asdf"
    >>> re.sub(r"(?:s|^)[^@]+@", lambda x: x.group().replace("_", r"_"), s)
    'he\_llo i lol\_@username_ \_ write me lol\_ @myusername_ asd@_a @_asdf'
    
    Login or Signup to reply.
  2. Extract with PyPi regex library:

    import regex
    string = "hello i like this char _ write me lol_ @myusername_"
    print(regex.findall(r'(?<!S)@w+(*SKIP)(*F)|_', string))
    # ['_', '_']
    

    See Python proof.

    Explanation

    --------------------------------------------------------------------------------
      (?<!                     look behind to see if there is not:
    --------------------------------------------------------------------------------
        S                       non-whitespace (all but n, r, t, f,
                                 and " ")
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      @                        '@'
    --------------------------------------------------------------------------------
      w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                               more times (matching the most amount  possible))
    --------------------------------------------------------------------------------
      (*SKIP)(*F)              skip the match, search from the failure location
    --------------------------------------------------------------------------------
      |                        or
    --------------------------------------------------------------------------------
      _                        a '_' char
    

    Remove with re:

    import re
    string = "hello i like this char _ write me lol_ @myusername_"
    print(re.sub(r'(?<!S)(@w+)|_', r'1', string))
    # hello i like this char  write me lol @myusername_
    

    See Python proof.

    Replace with re:

    import re
    string = "hello i like this char _ write me lol_ @myusername_"
    print(re.sub(r'(?<!S)(@w+)|_', lambda x: x.group(1) or "-", string))
    # hello i like this char - write me lol- @myusername_
    

    See another Python proof.

    Login or Signup to reply.
  3. You could match all non whitspace chars after matching @ and capture the _ in a group using an alternation. If the callback of re.sub, check if group 1 exists.

    If it does, return an escaped underscore or the excaped group 1 value (which is also an underscore), else return the match to leave it unchanged.

    @S+|(_)
    

    Regex demo

    import re
    
    strings = [
        "_",
        "lol_",
        "@username_",
        "lol@username_",
        "lol_@username",
        "lol_@username_"
    ]
    
    for s in strings:
        result = re.sub(
            r"@S+|(_)",
            lambda x: x.group(1).replace("_", r"_") if x.group(1) else x.group(),
            s
        )
        print(result)
    

    Output

    _
    lol_
    @username_
    lol@username_
    lol_@username
    lol_@username_
    
    Login or Signup to reply.
  4. Based on @OlvinRoght’s comment, with a small edit, this should do the trick:

    Regex

    ((?:^|s)(?:[^@s]*?))(_)((?:[^@s]*?))(?=@|s|$)

    Code example

    import re
    
    text = '_hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_'
    
    regex = r"((?:^|s)(?:[^@s]*?))(_)((?:[^@s]*?))(?=@|s|$)"
    
    # Leave the first and last capturing group as-is and replace the underscore with '_'
    subst = "\1\\_\3"
    
    print( re.sub(regex, subst, text) )
    

    Expected output:

    _hi hello i like this char _ write me lol_ _word something_ @myusername_ something_@username_
    

    Demo

    See it live

    Note:

    Although this works, @TheFourthBird’s answer is faster. (And more elegant I think.)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search