skip to Main Content

I have a game telegram bot which uses first name – last name pairs to spell out a top chart of users in a chat by their score. Screenshot example below:

normal markup

So, every user has a link to them. The actual code to generate a link:

EscapeType = typing.Literal['html']


def escape_string(s: str, escape: EscapeType | None = None) -> str:
    if escape == 'html':
        s = html_escape(s)
    elif escape is None:
        pass
    else:
        raise NotImplementedError(escape)
    return s


def getter(d):
    if isinstance(d, User):
        return lambda attr: getattr(d, attr, None)
    elif hasattr(d, '__getitem__') and hasattr(d, 'get'):
        return lambda attr: d.get(attr, None)
    else:
        return lambda attr: getattr(d, attr, None)


def personal_appeal(user: User | dict, escape: EscapeType | None = 'html') -> str:
    get = getter(user)

    if full_name := get("full_name"):
        appeal = full_name
    elif name := get("name"):
        appeal = name
    elif first_name := get("first_name"):
        if last_name := get("last_name"):
            appeal = f"{first_name} {last_name}"
        else:
            appeal = first_name
    elif username := get('username'):
        appeal = username
    else:
        raise ValueError(user)

    return escape_string(appeal, escape)


def user_mention(id: int | User, name: str | None = None, escape: EscapeType | None = 'html') -> str:
    if isinstance(id, User):
        user = id
        id = user.id
        name = personal_appeal(user)

    name = escape_string(name, escape=escape)

    if name is None:
        name = "N/A"

    if id is not None:
        return f'<a href="tg://user?id={id}">{name}</a>'
    else:
        return name

Basically, this code generates a link from a user name – user ID pair. As you can see, the name is HTML escaped by default.

There is, however, one user, which breaks this code somehow, by their unusual first name, and here is the actual sequence of characters they use:

'$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝'

Screenshot of the result of the same code run against this first name:

bad markup

As you can see, telegram seems to be lost in the markup. The link escapes onto other unrelated characters, and the <b> tag is broken, too.

This is the actual string which is being sent to the telegram servers (except for the ids, those I redacted out):

🔝🏆 <u>Рейтинг игроков чата</u>:

🥇 1. <a href="tg://user?id=1">andy alexanderson</a> (<b>40</b>)
🥈 2. <a href="tg://user?id=2">$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝</a> (<b>40</b>)
🤡 3. <a href="tg://user?id=3">: )</a> (<b>0</b>)

⏱️ <i>Рейтинг составлен 1 минуту назад</i>.
⏭️ <i>Следующее обновление через 28 минут</i>.

Seems like the only odd thing in this markup is the nickname, though.

Is this a Telegram bug?

Can something be done to mitigate this, so that my users wouldn’t be able to escape the HTML markup? I am willing to sacrifice the correctness of their name representation (due to the fact that such users willingly obfuscate their names), but I need to somehow be able to tell apart something which would break the markup.

Or maybe there is some UTF-16 <-> UTF-8 encoding stuff going on that I’m missing out on?

Framework used: python-telegram-bot.
Python version: 3.10.12.

2

Answers


  1. Chosen as BEST ANSWER

    As @roganjosh pointed out, this turns out to be a so-called "zalgo" sequence of characters. To remove the zalgo characters, I first found this decode function from an old JS library called lunicode.js. I found it by reversing this zalgo-text encoder-decoder website.

    It turned out to be a very simple function, so here it is written in python:

    def remove_zalgo(txt: str) -> str:
        return ''.join([
            char
            for char in txt
            if ord(char) < 768 or ord(char) > 865
        ])
    

    Now my markup doesn't break, and there are no zalgo characters in names of my users. I think, it's a win :)


  2. You can use Unidecode:

    from unidecode import unidecode
    print(unidecode('$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝<'))
    # output:
    # $. ..  <
    

    And with a more meaningful input:

    from unidecode import unidecode
    print(unidecode('ᴮᴵᴳᴮᴵᴿᴰ'))
    # output:
    # BIGBIRD
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search