I trying to do some text mining using twitter data. I do the following:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))
This all works fine but when I want to use the tolower function of the corpus package like this:
data=as.data.frame(cbind(tweet=rogertext))
corpus=Corpus(VectorSource(data$tweet))
corpus=tm_map(corpus,tolower)
It trows this error:
> corpus=tm_map(corpus,tolower)
Error in FUN(X[[i]], ...) :
invalid input 'RT @Federerism: Roger Federer reaches 5 million followers on twitter Love You Roger í ½í¸˜ í ½í¸ í ½í¸˜ í ½í¸ #Roger #Federer # Federerism #Maestro https:/…' in 'utf8towcs'
Any thought on what goes wrong?
2
Answers
Try the following:
There has been a change of syntax in the
tm
package a few years ago. Hope this solves the problem.base::tolower
chokes on special characters. This is often a problem when mining tweets. You could try catching errors or just use stringi’s tolower pendant: