I was trying to understand what topics do some of the celebritites about. I established a twitter API connection and got tweets of a few personalities from their verified handles.
I processed the tweets by following –
- Replaced graphic characters by blank
AmitText=str_replace_all(tweets.df$text,"[^[:graph:]]", " ")
- Converted all characters to lower case
- Removed punctuations, hyperlinks, tabs, Keyword “rt” and blankspaces at the begining and end of tweets
- Created corpus, removed stopwords and created a wordcloud
AmitText.corpus <- Corpus(VectorSource(AmitText))
AmitText.corpus <- tm_map(AmitText.corpus, removeWords, stopwords("en"))
wordcloud(AmitText.corpus,min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"),random.color= FALSE, random.order = FALSE, max.words = 150)
This creates a decent wordcloud, but the problem is, I get a big ‘fffd’ in the middle of the wordcloud, suggesting that this is the word tweeted by the celeb the most. In fact, this is the pattern I see for all the 7 celebrities. Although I was sure this cannot be the case, I checked their raw tweets too, and found no such word as fffd in their tweets. From what I understand, this is some graphic character that isn’t getting read correctly. I am not sure what is the reason and google isn’t of much help
2
Answers
Let’s try this in the beginning of your data pre-processing.
Hope this helps!
Don’t forget to let us know if it solved your problem 🙂
They are not junk characters. They are designed to tell you and your users that somewhere data was lost due to mishandling of their text.
There is a big difference between “Please pay �1000” and “Please pay 1000” when the original is “Please pay ₹1000″—or was it “Please pay ₿1000”? Removing � is not an ideal solution.
Somewhere some program read a text file or stream using a character encoding other than the one it was written or sent with. Simple as that. Hopefully, you can fix it upstream.