skip to Main Content

I’m a begginer at python and I’m trying to gather data from twitter using the API. I want to gather username, date, and the clean tweets without @username, hashtags and links and then put it into dataframe.

I find a way to achieve this by using : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split()) but when I implement it on my codes, it returns NameError: name 'tweet' is not defined

Here is my codes

tweets = tw.Cursor(api.search, q=keyword, lang="id", since=date).items()

raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split())

data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet] for tweet in tweets]

dataFrame = pd.DataFrame(data=data_tweet, columns=['user', "date", "tweet"])

I know the problem is in the data_tweet, but I don’t know how to fix it. Please help me

Thank you.

2

Answers


  1. The problem is actually in the second line:

    raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split())
    

    Here, you are using tweet.text. However, you have not defined what tweet is yet, only tweets. Also, from reading your third line where you actually define tweet:

    for tweet in tweets
    

    I’m assuming you want tweet to be the value you get while iterating through tweets.
    So what you have to do is to run both lines through an iterator together, assuming my earlier hypothesis is correct.
    So:

    for tweet in tweets:
        raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split())
        data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet]]
    
    Login or Signup to reply.
  2. You can also use reg-ex to remove any words the start with ‘@’ (usernames) or ‘http’ (links) in a pre-defined function and apply the function to the pandas data frame column

    import re
    
    def remove_usernames_links(tweet):
        tweet = re.sub('@[^s]+','',tweet)
        tweet = re.sub('http[^s]+','',tweet)
        return tweet
    df['tweet'] = df['tweet'].apply(remove_usernames_links)
    

    If you encounter, "expected string or byte-like object error", then just use

    import re
        
        def remove_usernames_links(tweet):
            tweet = re.sub('@[^s]+','',str(tweet))
            tweet = re.sub('http[^s]+','',str(tweet))
            return tweet
        df['tweet'] = df['tweet'].apply(remove_usernames_links)
    

    Credit: https://www.datasnips.com/59/remove-usernames-http-links-from-tweet-data/

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search