Python 3 UnicodeEncodeError for characters and smileys in Tweets - Twitter API

GLHF
May 30, 2016
142 views
0 votes
2 Answers

I’m making a Twitter API, I get tweets about a specific word (right now it’s ‘flafel’). Everything is fine except this tweet

b’And when I’m thinking about getting the chili sauce on my flafel
and the waitress, a Pinay, tells me not to get it cos “hindi yan
masarap.”xf0x9fx98x82′

I use print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8'))) to see tweets, but this one gives me UnicodeEncodeError every time and if I erase decode() from that line like print ("Tweet info: {}".format(str(tweet.text).encode('utf-8')) I can see the actual tweet like above, but I want to convert that xf0x9fx98x82 part to a str. I tried everyting, every version of decodes-encodes etc. How can I solve this problem?

Edit: Well I just went to that user’s Twitter account to see what is that non-ASCII part, and it turns out it’s a smile:

Is it possible to convert that smiley?

Edit2: The codes are;

...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
                           q = "flafel",
                           result_type = "recent",
                           include_entities = True,
                           lang = "en").items():

    print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))

Answers

- SergeBallesta
- May 30, 2016 at 5:49 pm
- 0 votes
0
The problem could arise at the moment you try to use the unicode character U0001f602 on Windows. Python-3 is fine for converting it from utf-8 to full unicode an back again, but windows is not able to display it.

I tried this piece of code in different ways on a Windows 7 box:
```
>>> b = b'And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."xf0x9fx98x82'
>>> u = b.decode('utf8')
>>> u
'And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."U0001f602'
>>> print(u)
```
Are here is what happened:
- in IDLE (Python GUI interpretor based on Tk), I got this error:
UnicodeEncodeError: ‘UCS-2’ codec can’t encode characters in position 139-139: Non-BMP character not supported in Tk
- in a console using a non unicode codepage I got this error:
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘U0001f602’ in position 139: character maps to <undefined>

(for the attentive reader BMP means here Basic Multilingual Plane)
- in a console using utf-8 codepage (chcp 65001) I got no error but a weird display:
```
>>> u
'And when I'm thinking about getting the chili sauce on my flafel and the waitr
ess, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚'
>>> print(u)
And when I'm thinking about getting the chili sauce on my flafel and the waitres
s, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚
>>>
```
My conclusion is that the error in not in the conversion utf-8 <-> unicode. But it looks that Window Tk version does not support this character, nor any console code page (except for 65001 that simply tries to display the individual utf8 bytes!)

TL/DR: The problem is not in core Python processing nor in the UTF-8 converter, but only at the system conversion that is used to display the character 'U0001f602'

But hopefully, as core Python has no problem in it, you can easily change the offending 'U0001f602' with a ':D' for example with a mere string.replace (after the code shows above):
```
>>> print (u.replace(U'U0001f602', ':D'))
```
```
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D
```
If you want a special processing for all characters outside the BMP, it is enough to know that the highest code for it is 0xFFFF. So you could use code like that:
```
def convert(t):
    with io.StringIO() as fd:
        for c in t:  # replace all chars outside BMP with a !
            dummy = fd.write(c if ord(c) < 0x10000 else '!')
        return fd.getvalue()
```
Login or Signup to reply.

As I mentioned in the comments, you can get the names of Unicode codepoints using the standard unicodedata module. Here’s a small demo:

import unicodedata as ud

test = ('And when I'm thinking about getting the chili sauce on my flafel and the '
    'waitress, a Pinay, tells me not to get it cos "hindi yan masarap."U0001F602')

def convert_special(c):
    if c > 'uffff':
        c = ':{}:'.format(ud.name(c).lower().replace(' ', '_')) 
    return c

def convert_string(s):
    return ''.join([convert_special(c) for c in s])

for s in (test, 'Some special symbols U0001F30C, ©, ®, ™, U0001F40D, u2323'): 
    print('{}n{}n'.format(s.encode('unicode-escape'), convert_string(s)))

output

b'And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, ©, ®, ™, :snake:, ⌣

Another option is to test if a character is in the Unicode “Symbol_Other” category. We can do that by replacing the

if c > 'uffff':

test in convert_special with

if ud.category(c) == 'So':

When we do that, we get this output:

b'And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, :copyright_sign:, :registered_sign:, :trade_mark_sign:, :snake:, :smile:

Please signup or login to give your own answer.

Click here to cancel reply.

Python 3 UnicodeEncodeError for characters and smileys in Tweets – Twitter API

Answers