Replace series of Unicode characters / Python / Twitter - Twitter-api

TheRedCameron
March 23, 2016
265 views
2 votes
3 Answers

I am pulling text from a tweet using the Twitter API and Python 3.3 and I’m running into the part of the tweet where the tweeter put three symbols in the tweet. They are shown below.

The two flags and the thumbs up seem to be causing the problem. The following is the plain text tweet.

RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary ud83cuddfaud83cuddf8ud83cuddfaud83cuddf8ud83dudc4d

The following is the code I’m using.

import json
import mysql.connector
import sys
from datetime import datetime
from MySQLCL import MySQLCL

class Functions(object):
"""This is a class for Python functions"""

@staticmethod
def Clean(string):
    temp = str(string)
    temp = temp.replace("'", "").replace("(", "").replace(")", "").replace(",", "").strip()
    return temp

@staticmethod
def ParseTweet(string):
    for x in range(0, len(string)):
        tweetid = string[x]["id_str"]
        tweetcreated = string[x]["created_at"]
        tweettext = string[x]["text"]
        tweetsource = string[x]["source"]
        tweetsource = tweetsource
        truncated = string[x]["truncated"]
        inreplytostatusid = string[x]["in_reply_to_status_id"]
        inreplytouserid = string[x]["in_reply_to_user_id"]
        inreplytoscreenname = string[x]["in_reply_to_screen_name"]
        geo = string[x]["geo"]
        coordinates = string[x]["coordinates"]
        place = string[x]["place"]
        contributors = string[x]["contributors"]
        isquotestatus = string[x]["is_quote_status"]
        retweetcount = string[x]["retweet_count"]
        favoritecount = string[x]["favorite_count"]
        favorited = string[x]["favorited"]
        retweeted = string[x]["retweeted"]
        if "possibly_sensitive" in string[x]:
            possiblysensitive = string[x]["possibly_sensitive"]
        else:
            possiblysensitive = ""
        language = string[x]["lang"]

        #print(possiblysensitive)
        print(Functions.UnicodeFilter(tweettext))
        #print(inreplytouserid)
        #print("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + Functions.UnicodeFilter(tweettext) + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + str(language) + "', '" + Functions.ToSQL(tweetcreated) + "', '" + Functions.ToSQL(tweetsource) + "', " + str(possiblysensitive) + ")")
        #MySQLCL.Set("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + tweettext + "', " + str(truncated) + ", " + Functions.CheckNullNum(inreplytostatusid) + ", " + Functions.CheckNullNum(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + language + "', '" + str(Functions.FormatDate(tweetcreated)) + "', '" + str(Functions.UnicodeFilter(tweetsource)) + "', " + str(possiblysensitive) + ")")

@staticmethod
def ToBool(variable):
    if variable.lower() == 'true':
        return True
    elif variable.lower() == 'false':
        return False

@staticmethod
def CheckNullNum(var):
    if var == None:
        return "0"
    else:
        return str(var)

@staticmethod
def CheckNull(var):
    if var == None:
        return ""
    else:
        return var

@staticmethod
def ToSQL(var):
    temp = var
    temp = temp.replace("'", "")
    return str(temp)

@staticmethod
def UnicodeFilter(var):
    temp = var
    temp = temp.replace(chr(0x2019), "")
    temp = temp.replace(chr(0x003c), "(lessthan)")
    temp = temp.replace(chr(0x003e), "(greaterthan)")
    temp = temp.replace(chr(0xd83c), "")
    temp = temp.replace(chr(0xddfa), "")
    temp = temp.replace(chr(0xddf8), "")
    temp = temp.replace(chr(0xd83d), "")
    temp = temp.replace(chr(0xdc4d), "")
    temp = Functions.ToSQL(temp)
    return temp

@staticmethod
def FormatDate(var):
    temp = var
    dt = datetime.strptime(temp, "%a %b %d %H:%M:%S %z %Y")
    retdt = str(dt.year) + "-" + str(dt.month) + "-" + str(dt.day) + "T" + str(dt.hour) + ":" + str(dt.minute) + ":" + str(dt.second)
    return retdt

As you can see, I’ve been using the function UnicodeFilter in order to try to filter out the unicode characters in hex. The function works when dealing with single unicode characters, but when encountering multiple unicode characters placed together, this method fails and gives the following error:

‘charmap’ codec can’t encode characters in position 107-111: character maps to ‘undefined’

Do any of you have any ideas about how to get past this problem?

UPDATE: I have tried Andrew Godbehere’s solution and I was still running into the same issues. However, I decided to see if there were any specific characters that were causing a problem, so I decided to print the characters to the console character by character. That gave me the error as follows:

‘charmap’ codec can’t encode character ‘U0001f1fa’ in position 0: character maps to ‘undefined’

Upon seeing this, I added this to the UnicodeFilter function and continued testing. I have run into multiple errors of the same kind while printing the tweets character by character. However, I don’t want to have to keep making these exceptions. For example, see the revised UnicodeFilter function:

@staticmethod
def UnicodeFilter(var):
    temp = var
    temp = temp.encode(errors='ignore').decode('utf-8')
    temp = temp.replace(chr(0x2019), "")
    temp = temp.replace(chr(0x003c), "(lessthan)")
    temp = temp.replace(chr(0x003e), "(greaterthan)")
    temp = temp.replace(chr(0xd83c), "")
    temp = temp.replace(chr(0xddfa), "")
    temp = temp.replace(chr(0xddf8), "")
    temp = temp.replace(chr(0xd83d), "")
    temp = temp.replace(chr(0xdc4d), "")
    temp = temp.replace(chr(0x2026), "")
    temp = temp.replace(u"U0001F1FA", "")
    temp = temp.replace(u"U0001F1F8", "")
    temp = temp.replace(u"U0001F44D", "")
    temp = temp.replace(u"U00014F18", "")
    temp = temp.replace(u"U0001F418", "")
    temp = temp.replace(u"U0001F918", "")
    temp = temp.replace(u"U0001F3FD", "")
    temp = temp.replace(u"U0001F195", "")
    temp = Functions.ToSQL(temp)
    return str(temp)

I don’t want to have to add a new line for every character that causes a problem. Through this method, I have been able to pass multiple tweets, but this issue resurfaces with every tweet that contains different symbols. Is there not a solution that will filter out all these characters? Is it possible to filter out all characters not in the utf-8 character set?

Answers

Chosen as BEST ANSWER
- TheRedCameron
- June 15, 2016 at 4:32 am
- 0 votes
0
Found the answer. The issue was that there was a range of characters in the tweets that were causing problems. Once I found the correct Unicode range for the characters, I implemented the for loop to replace any occurrence of any Unicode character within that range. After implementing that, I was able to pull thousands of tweets without any formatting or MySQL errors at all.
```
@staticmethod
def UnicodeFilter(var):
    temp = var
    temp = temp.replace(chr(0x2019), "'")
    temp = temp.replace(chr(0x2026), "")
    for x in range(127381, 129305):
        temp = temp.replace(chr(x), "")
    temp = MySQLCL.ToSQL(temp)
    return str(temp)
```

(Edit)

- AndrewGodbehere
- March 23, 2016 at 4:49 am
- 0 votes
0
Try the built-in unicode encode/decode error handling functionality: str.encode(errors='ignore')

For example:
```
problem_string = """
RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary ud83cuddfaud83cuddf8ud83cuddfaud83cuddf8ud83dudc4d
"""
print(problem_string.encode(errors='ignore').decode('utf-8'))
```
Ignoring errors removes problematic characters.
```
> RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary 
```
Other error handling options may be of interest.
xmlcharrefreplace for instance would yield:

> RT @John_Hunt07: Just voted for @marcorubio is Florida! I am ready for a New American Century!! #FLPrimary &#55356;&#56826;&#55356;&#56824;&#55356;&#56826;&#55356;&#56824;&#55357;&#56397;

If you require custom filtering as implied by your UnicodeFilter function, see Python documentation on registering an error handler.
Login or Signup to reply.

- AlastairMcCormack
- March 24, 2016 at 10:51 pm
- 0 votes
0
Python provides a useful stacktrace so you can tell where errors are coming from.
Using it, you will have found that your print is causing the exception.

print() is failing because you’re running Python from the Windows console, which, by default only, supports your local 8bit charmap. You can add support with: https://github.com/Drekin/win-unicode-console

You can also just write your data straight to a text file. Open the file with:
```
open('output.txt', 'w', encoding='utf-8')
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Replace series of Unicode characters / Python / Twitter – Twitter-api

Answers