skip to Main Content

I was trying to understand what topics do some of the celebritites about. I established a twitter API connection and got tweets of a few personalities from their verified handles.

I processed the tweets by following –

  1. Replaced graphic characters by blank
    AmitText=str_replace_all(tweets.df$text,"[^[:graph:]]", " ")
  2. Converted all characters to lower case
  3. Removed punctuations, hyperlinks, tabs, Keyword “rt” and blankspaces at the begining and end of tweets
  4. Created corpus, removed stopwords and created a wordcloud
    AmitText.corpus <- Corpus(VectorSource(AmitText))
    AmitText.corpus <- tm_map(AmitText.corpus, removeWords, stopwords("en"))
    wordcloud(AmitText.corpus,min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"),random.color= FALSE, random.order = FALSE, max.words = 150)

This creates a decent wordcloud, but the problem is, I get a big ‘fffd’ in the middle of the wordcloud, suggesting that this is the word tweeted by the celeb the most. In fact, this is the pattern I see for all the 7 celebrities. Although I was sure this cannot be the case, I checked their raw tweets too, and found no such word as fffd in their tweets. From what I understand, this is some graphic character that isn’t getting read correctly. I am not sure what is the reason and google isn’t of much help

2

Answers


  1. Let’s try this in the beginning of your data pre-processing.

    iconv(tweet$text, from="UTF-8", to="ASCII", sub="")
    

    Hope this helps!

    Don’t forget to let us know if it solved your problem 🙂

    Login or Signup to reply.
  2. They are not junk characters. They are designed to tell you and your users that somewhere data was lost due to mishandling of their text.

    There is a big difference between “Please pay �1000” and “Please pay 1000” when the original is “Please pay ₹1000″—or was it “Please pay ₿1000”? Removing � is not an ideal solution.

    Somewhere some program read a text file or stream using a character encoding other than the one it was written or sent with. Simple as that. Hopefully, you can fix it upstream.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search