I am pulling text from a tweet using the Twitter API and Python 3.3 and I’m running into the part of the tweet where the tweeter put three symbols in the tweet. They are shown below.
The two flags and the thumbs up seem to be causing the problem. The following is the plain text tweet.
RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary ud83cuddfaud83cuddf8ud83cuddfaud83cuddf8ud83dudc4d
The following is the code I’m using.
import json
import mysql.connector
import sys
from datetime import datetime
from MySQLCL import MySQLCL
class Functions(object):
"""This is a class for Python functions"""
@staticmethod
def Clean(string):
temp = str(string)
temp = temp.replace("'", "").replace("(", "").replace(")", "").replace(",", "").strip()
return temp
@staticmethod
def ParseTweet(string):
for x in range(0, len(string)):
tweetid = string[x]["id_str"]
tweetcreated = string[x]["created_at"]
tweettext = string[x]["text"]
tweetsource = string[x]["source"]
tweetsource = tweetsource
truncated = string[x]["truncated"]
inreplytostatusid = string[x]["in_reply_to_status_id"]
inreplytouserid = string[x]["in_reply_to_user_id"]
inreplytoscreenname = string[x]["in_reply_to_screen_name"]
geo = string[x]["geo"]
coordinates = string[x]["coordinates"]
place = string[x]["place"]
contributors = string[x]["contributors"]
isquotestatus = string[x]["is_quote_status"]
retweetcount = string[x]["retweet_count"]
favoritecount = string[x]["favorite_count"]
favorited = string[x]["favorited"]
retweeted = string[x]["retweeted"]
if "possibly_sensitive" in string[x]:
possiblysensitive = string[x]["possibly_sensitive"]
else:
possiblysensitive = ""
language = string[x]["lang"]
#print(possiblysensitive)
print(Functions.UnicodeFilter(tweettext))
#print(inreplytouserid)
#print("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + Functions.UnicodeFilter(tweettext) + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + str(language) + "', '" + Functions.ToSQL(tweetcreated) + "', '" + Functions.ToSQL(tweetsource) + "', " + str(possiblysensitive) + ")")
#MySQLCL.Set("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + tweettext + "', " + str(truncated) + ", " + Functions.CheckNullNum(inreplytostatusid) + ", " + Functions.CheckNullNum(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + language + "', '" + str(Functions.FormatDate(tweetcreated)) + "', '" + str(Functions.UnicodeFilter(tweetsource)) + "', " + str(possiblysensitive) + ")")
@staticmethod
def ToBool(variable):
if variable.lower() == 'true':
return True
elif variable.lower() == 'false':
return False
@staticmethod
def CheckNullNum(var):
if var == None:
return "0"
else:
return str(var)
@staticmethod
def CheckNull(var):
if var == None:
return ""
else:
return var
@staticmethod
def ToSQL(var):
temp = var
temp = temp.replace("'", "")
return str(temp)
@staticmethod
def UnicodeFilter(var):
temp = var
temp = temp.replace(chr(0x2019), "")
temp = temp.replace(chr(0x003c), "(lessthan)")
temp = temp.replace(chr(0x003e), "(greaterthan)")
temp = temp.replace(chr(0xd83c), "")
temp = temp.replace(chr(0xddfa), "")
temp = temp.replace(chr(0xddf8), "")
temp = temp.replace(chr(0xd83d), "")
temp = temp.replace(chr(0xdc4d), "")
temp = Functions.ToSQL(temp)
return temp
@staticmethod
def FormatDate(var):
temp = var
dt = datetime.strptime(temp, "%a %b %d %H:%M:%S %z %Y")
retdt = str(dt.year) + "-" + str(dt.month) + "-" + str(dt.day) + "T" + str(dt.hour) + ":" + str(dt.minute) + ":" + str(dt.second)
return retdt
As you can see, I’ve been using the function UnicodeFilter in order to try to filter out the unicode characters in hex. The function works when dealing with single unicode characters, but when encountering multiple unicode characters placed together, this method fails and gives the following error:
‘charmap’ codec can’t encode characters in position 107-111: character maps to ‘undefined’
Do any of you have any ideas about how to get past this problem?
UPDATE: I have tried Andrew Godbehere’s solution and I was still running into the same issues. However, I decided to see if there were any specific characters that were causing a problem, so I decided to print the characters to the console character by character. That gave me the error as follows:
‘charmap’ codec can’t encode character ‘U0001f1fa’ in position 0: character maps to ‘undefined’
Upon seeing this, I added this to the UnicodeFilter function and continued testing. I have run into multiple errors of the same kind while printing the tweets character by character. However, I don’t want to have to keep making these exceptions. For example, see the revised UnicodeFilter function:
@staticmethod
def UnicodeFilter(var):
temp = var
temp = temp.encode(errors='ignore').decode('utf-8')
temp = temp.replace(chr(0x2019), "")
temp = temp.replace(chr(0x003c), "(lessthan)")
temp = temp.replace(chr(0x003e), "(greaterthan)")
temp = temp.replace(chr(0xd83c), "")
temp = temp.replace(chr(0xddfa), "")
temp = temp.replace(chr(0xddf8), "")
temp = temp.replace(chr(0xd83d), "")
temp = temp.replace(chr(0xdc4d), "")
temp = temp.replace(chr(0x2026), "")
temp = temp.replace(u"U0001F1FA", "")
temp = temp.replace(u"U0001F1F8", "")
temp = temp.replace(u"U0001F44D", "")
temp = temp.replace(u"U00014F18", "")
temp = temp.replace(u"U0001F418", "")
temp = temp.replace(u"U0001F918", "")
temp = temp.replace(u"U0001F3FD", "")
temp = temp.replace(u"U0001F195", "")
temp = Functions.ToSQL(temp)
return str(temp)
I don’t want to have to add a new line for every character that causes a problem. Through this method, I have been able to pass multiple tweets, but this issue resurfaces with every tweet that contains different symbols. Is there not a solution that will filter out all these characters? Is it possible to filter out all characters not in the utf-8 character set?
3
Answers
Found the answer. The issue was that there was a range of characters in the tweets that were causing problems. Once I found the correct Unicode range for the characters, I implemented the for loop to replace any occurrence of any Unicode character within that range. After implementing that, I was able to pull thousands of tweets without any formatting or MySQL errors at all.
Try the built-in unicode encode/decode error handling functionality:
str.encode(errors='ignore')
For example:
Ignoring errors removes problematic characters.
Other error handling options may be of interest.
xmlcharrefreplace
for instance would yield:> RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary ����������
If you require custom filtering as implied by your
UnicodeFilter
function, see Python documentation on registering an error handler.Python provides a useful stacktrace so you can tell where errors are coming from.
Using it, you will have found that your
print
is causing the exception.print()
is failing because you’re running Python from the Windows console, which, by default only, supports your local 8bit charmap. You can add support with: https://github.com/Drekin/win-unicode-consoleYou can also just write your data straight to a text file. Open the file with: