skip to Main Content

I trying to do some text mining using twitter data. I do the following:

#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

#set radius and amount of requests
N=200  # tweets to request from each query
S=200  # radius in miles

lats=c(38.9,40.7)
lons=c(-77,-74)

roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
                                                                  lang="en",n=N,resultType="recent",
                                                                  geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))

This all works fine but when I want to use the tolower function of the corpus package like this:

data=as.data.frame(cbind(tweet=rogertext))
corpus=Corpus(VectorSource(data$tweet))
corpus=tm_map(corpus,tolower)

It trows this error:

> corpus=tm_map(corpus,tolower)
Error in FUN(X[[i]], ...) : 
invalid input 'RT @Federerism: Roger Federer reaches  5 million followers   on twitter  Love You Roger í ½í¸˜ í ½í¸ í ½í¸˜ í ½í¸ #Roger #Federer #   Federerism #Maestro https:/…' in 'utf8towcs'

Any thought on what goes wrong?

2

Answers


  1. Try the following:

    corpus <- tm_map(corpus, content_transformer(tolower))
    

    There has been a change of syntax in the tm package a few years ago. Hope this solves the problem.

    Login or Signup to reply.
  2. base::tolower chokes on special characters. This is often a problem when mining tweets. You could try catching errors or just use stringi’s tolower pendant:

    # tw <- searchTwitter('Roger Federer reaches  5 million followers   on twitter  Love You Roger', n=1) 
    download.file("https://www.dropbox.com/s/33ilhcu2v82nwuq/twitter_tolower.rda?dl=1", tf <- tempfile(fileext = ".rda"), mode="wb")
    load(tf) 
    
    tw[[1]]$getText()
    # [1] "RT @Federerism: Roger Federer reaches  5 million followers on twitter  Love You Roger xed��xed�u0098 xed��xed�u008d xed��xed�u0098 xed��xed�u008d #Roger #Federer # Federerism #Maestro https:/…"
    
    ## Does not work:
    tolower(tw[[1]]$getText())
    # Error in tolower(tw[[1]]$getText()) : 
    #   invalid input 'RT @Federerism: Roger Federer reaches  5 million followers on twitter  Love You Roger í ½í¸˜ í ½í¸ í ½í¸˜ í ½í¸ #Roger #Federer # Federerism #Maestro https:/…' in 'utf8towcs'
    
    ## Works:
    stringi::stri_trans_tolower(tw[[1]]$getText())
    # [1] "rt @federerism: roger federer reaches  5 million followers on twitter  love you roger xed��xed�u0098 xed��xed�u008d xed��xed�u0098 xed��xed�u008d #roger #federer # federerism #maestro https:/…"
    
    ## Works, too:
    library(tm)
    corp <- Corpus(VectorSource(tw[[1]]$getText()))
    corp <- tm_map(corp, content_transformer(stringi::stri_trans_tolower))
    content(corp[[1]])
    # [1] "rt @federerism: roger federer reaches  5 million followers on twitter  love you roger xed��xed�u0098 xed��xed�u008d xed��xed�u0098 xed��xed�u008d #roger #federer # federerism #maestro https:/…"
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search