skip to Main Content

I’m pulling Twitter data via their API and one of the tweets has a special character (the right apostrophe) and I keep getting an error saying that Python can’t map or character map the character. I’ve looked all over the Internet, but I have yet to find a solution for this issue. I just want to replace that character with either an apostrophe that Python will recognize, or an empty string (essentially removing it). I’m using Python 3.3. Any input on how to fix this problem? It may seem simple, but I’m a newbie at Python.

Edit: Here is the function I’m using to try to filter out the unicode characters that throw errors.

@staticmethod
def UnicodeFilter(var):
    temp = var
    temp = temp.replace(chr(2019), "'")
    temp = Functions.ToSQL(temp)
    return temp

Also, when running the program, my error is as follows.

‘charmap’ codec can’t encode character ‘u2019’ in position 59: character maps to ‘undefined’

Edit: Here is a sample of my source code:

import json
import mysql.connector
import unicodedata
from MySQLCL import MySQLCL

class Functions(object):
"""This is a class for Python functions"""

@staticmethod
def Clean(string):
    temp = str(string)
    temp = temp.replace("'", "").replace("(", "").replace(")", "").replace(",", "").strip()
    return temp

@staticmethod
def ParseTweet(string):
    for x in range(0, len(string)):
        tweetid = string[x]["id_str"]
        tweetcreated = string[x]["created_at"]
        tweettext = string[x]["text"]
        tweetsource = string[x]["source"]
        truncated = string[x]["truncated"]
        inreplytostatusid = string[x]["in_reply_to_status_id"]
        inreplytouserid = string[x]["in_reply_to_user_id"]
        inreplytoscreenname = string[x]["in_reply_to_screen_name"]
        geo = string[x]["geo"]
        coordinates = string[x]["coordinates"]
        place = string[x]["place"]
        contributors = string[x]["contributors"]
        isquotestatus = string[x]["is_quote_status"]
        retweetcount = string[x]["retweet_count"]
        favoritecount = string[x]["favorite_count"]
        favorited = string[x]["favorited"]
        retweeted = string[x]["retweeted"]
        possiblysensitive = string[x]["possibly_sensitive"]
        language = string[x]["lang"]

        print(Functions.UnicodeFilter(tweettext))
        #print("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + Functions.UnicodeFilter(tweettext) + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + str(language) + "', '" + Functions.ToSQL(tweetcreated) + "', '" + Functions.ToSQL(tweetsource) + "', " + str(possiblysensitive) + ")")
        #MySQLCL.Set("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + tweettext + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + language + "', '" + tweetcreated + "', '" + str(tweetsource) + "', " + str(possiblysensitive) + ")")

@staticmethod
def ToBool(variable):
    if variable.lower() == 'true':
        return True
    elif variable.lower() == 'false':
        return False

@staticmethod
def CheckNull(var):
    if var == None:
        return ""
    else:
        return var

@staticmethod
def ToSQL(var):
    temp = var
    temp = temp.replace("'", "''")
    return str(temp)

@staticmethod
def UnicodeFilter(var):
    temp = var
    #temp = temp.replace(chr(2019), "'")
    unicodestr = unicode(temp, 'utf-8')
    if unicodestr != temp:
        temp = "'"
    temp = Functions.ToSQL(temp)
    return temp

ekhumoro’s response was correct.

3

Answers


  1. unicode_string = unicode(some_string, 'utf-8')
    if unicode_string != some_string:
        some_string = 'whatever you want it to be'
    
    Login or Signup to reply.
  2. you can Encode your unicode string to convert to type str :

    a=u"dataàçççñññ"
    type(a)
    a.encode('ascii','ignore')
    

    this way it will delete the special characters will return you ‘data’.

    other way you can use unicodedata

    Login or Signup to reply.
  3. There seem to be two problems with your program.

    Firstly, you are passing the wrong code point to chr(). The hexdecimal code point of the character is 0x2019, but you are passing in the decimal number 2019 (which equates to 0x7e3 in hexadecimal). So you need to do either:

        temp = temp.replace(chr(0x2019), "'") # hexadecimal
    

    or:

        temp = temp.replace(chr(8217), "'") # decimal
    

    in order to replace the character correctly.

    Secondly, the reason you are getting the error is because some other part of your program (probably the database backend) is trying to encode unicode strings using some encoding other than UTF-8. It’s hard to be more precise about this, because you did not include the full traceback in your question. However, the reference to “charmap” suggests a Windows code page is being used (but not cp1252); or an iso encoding (but not iso8859-1, aka latin1); or possibly KOI8_R.

    Anyway, the correct way to deal with this issue is to ensure all parts of your program (and especially the database) use UTF-8. If you do that, you won’t have to mess about replacing characters anymore.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search