Python unicode string - position - Telegram API

user3352603
May 3, 2020
196 views
0 votes
2 Answers

I stuck with getting the position inside a String.
I read the content of a file

with io.open(testfile, 'r', encoding='utf-8') as f

u2705 Offizielle Kanu00e4le ud83cudde9ud83cuddea  ud83cudde6ud83cuddf9 ud83cudde8ud83cuddedn@GET_THIS_STING

What do I have to do – that “u2705” is counted as 1 letter?
Then Position 36 would be the start of @GET_THIS_STING

–== EDIT ==–
I can now better show whats the problem:

import json
from io import open

line = '{"message":{"message_id":3052,"text":"u2705 Offizielle Kanu00e4le ud83cudde9ud83cuddea  ud83cudde6ud83cuddf9 ud83cudde8ud83cudded\n@GET_THIS_STING\n123456789","entities":[{"offset":36,"length":26,"type":"mention"}]}}'
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(str(text).encode('utf-8', 'replace').decode())
print("string length: " + str(len(text)))
print(text[36:36+15])

print("-------------")

with open("/home/pi/telegram/phpLogs/test.txt", 'r', encoding='utf-8', errors="surrogateescape") as f:
    for line in f:
        myjson = json.loads(line)

        text = myjson.get("message", {}).get("text", None)
        print(text)
        print("string length: " + str(len(text)))
        print(text[36:36+15])

RESULT:

✅ Offizielle Kanäle ????  ???? ????
@GET_THIS_STING
123456789
string length: 61
@GET_THIS_STING
-------------
✅ Offizielle Kanäle 🇩🇪  🇦🇹 🇨🇭
@GET_THIS_STING123456789
string length: 54
HIS_STING123456

So when I have the string inside my code (UTF-8) as a variable (String), everything works fine.
But when I create a file with content and read it

"{"message":{"message_id":3052,"text":"u2705 Offizielle Kanu00e4le ud83cudde9ud83cuddea  ud83cudde6ud83cuddf9 ud83cudde8ud83cudded\n@GET_THIS_STING\n123456789","entities":[{"offset":36,"length":26,"type":"mention"}]}}"

I always receive a “wrong” result 🙁
So reading a file is my problem, because the strings are not the same afterwards – even the length is different!

Answers

If your file text.txt literally contains,

u2705 Offizielle Kanu00e4le ud83cudde9ud83cuddea  ud83cudde6ud83cuddf9 ud83cudde8ud83cuddedn@GET_THIS_STING

Try:

with open('text.txt', 'r', encoding='utf-8') as f:
    str = f.read()
    normal_str = ''
    i, n = 0, 0
    while i < len(str):
        if str[i: i + 2] == '\u':
            i += 6
            normal_str += 'x'
        elif str[i: i + 2] == '\n':
            i += 2
            normal_str += 'x'
        else:
            normal_str += str[i]
            i += 1
        n += 1
    print(normal_str)
    print(normal_str[36:36 + 15])

And, this outputs:

x Offizielle Kanxle xxxx  xxxx xxxxx@GET_THIS_STING

@GET_THIS_STING

With a file text.txt that looks something like this,

✅ Offizielle Kanäle 🇩🇪  🇦🇹 🇨🇭
@GET_THIS_STING

We can do,

with open('text.txt', 'r', encoding='utf-8') as f:
    str = f.read()
    index = str.find('@')
    print('char @ is at index: {}'.format(index))
    print(str[index:])

It outputs,

char @ is at index: 30
@GET_THIS_STING

- Botosmtek
- May 3, 2020 at 11:52 am
- 0 votes
0
If this string represents ✅ Offizielle Kanäle 🇩🇪 🇦🇹 🇨🇭 as suggested by @scribe’s answer, then I think you run into the problem mentioned here: Converting to Emoji

Therefore I suggest replacing
```
with io.open(testfile, 'r', encoding='utf-8') as f:
    text = f.read() # you didn't show it but probably that's what you have done
```
with
```
with open(testfile, 'r', encoding='ascii') as f:
    text = json.load(f)
```
or, if the file is “JSON lines” rather than single JSON:
```
with open(testfile, 'r', encoding='ascii') as f:
    for line in f:
        text = json.loads(line)
```
and then text will be a proper Unicode string, so text[36:] should get you what you asked for.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Python unicode string – position – Telegram API

Answers