I’m making a Twitter API, I get tweets about a specific word (right now it’s ‘flafel’). Everything is fine except this tweet
b’And when I’m thinking about getting the chili sauce on my flafel
and the waitress, a Pinay, tells me not to get it cos “hindi yan
masarap.”xf0x9fx98x82′
I use print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))
to see tweets, but this one gives me UnicodeEncodeError every time and if I erase decode()
from that line like print ("Tweet info: {}".format(str(tweet.text).encode('utf-8'))
I can see the actual tweet like above, but I want to convert that xf0x9fx98x82
part to a str. I tried everyting, every version of decodes-encodes etc. How can I solve this problem?
Edit: Well I just went to that user’s Twitter account to see what is that non-ASCII part, and it turns out it’s a smile:
Is it possible to convert that smiley?
Edit2: The codes are;
...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
q = "flafel",
result_type = "recent",
include_entities = True,
lang = "en").items():
print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))
2
Answers
The problem could arise at the moment you try to use the unicode character
U0001f602
on Windows. Python-3 is fine for converting it from utf-8 to full unicode an back again, but windows is not able to display it.I tried this piece of code in different ways on a Windows 7 box:
Are here is what happened:
(for the attentive reader BMP means here Basic Multilingual Plane)
in a console using utf-8 codepage (chcp 65001) I got no error but a weird display:
My conclusion is that the error in not in the conversion utf-8 <-> unicode. But it looks that Window Tk version does not support this character, nor any console code page (except for 65001 that simply tries to display the individual utf8 bytes!)
TL/DR: The problem is not in core Python processing nor in the UTF-8 converter, but only at the system conversion that is used to display the character
'U0001f602'
But hopefully, as core Python has no problem in it, you can easily change the offending
'U0001f602'
with a':D'
for example with a merestring.replace
(after the code shows above):If you want a special processing for all characters outside the BMP, it is enough to know that the highest code for it is
0xFFFF
. So you could use code like that:As I mentioned in the comments, you can get the names of Unicode codepoints using the standard unicodedata module. Here’s a small demo:
output
Another option is to test if a character is in the Unicode “Symbol_Other” category. We can do that by replacing the
test in
convert_special
withWhen we do that, we get this output: