I’m a begginer at python and I’m trying to gather data from twitter using the API. I want to gather username, date, and the clean tweets without @username, hashtags and links and then put it into dataframe.
I find a way to achieve this by using : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split())
but when I implement it on my codes, it returns NameError: name 'tweet' is not defined
Here is my codes
tweets = tw.Cursor(api.search, q=keyword, lang="id", since=date).items()
raw_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",tweet.text).split())
data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet] for tweet in tweets]
dataFrame = pd.DataFrame(data=data_tweet, columns=['user', "date", "tweet"])
I know the problem is in the data_tweet
, but I don’t know how to fix it. Please help me
Thank you.
2
Answers
The problem is actually in the second line:
Here, you are using tweet.text. However, you have not defined what tweet is yet, only tweets. Also, from reading your third line where you actually define tweet:
I’m assuming you want tweet to be the value you get while iterating through tweets.
So what you have to do is to run both lines through an iterator together, assuming my earlier hypothesis is correct.
So:
You can also use reg-ex to remove any words the start with ‘@’ (usernames) or ‘http’ (links) in a pre-defined function and apply the function to the pandas data frame column
If you encounter, "expected string or byte-like object error", then just use
Credit: https://www.datasnips.com/59/remove-usernames-http-links-from-tweet-data/