I know similar questions have been asked about this, but the project I’m working is using Tweepy for Python so it’s a little more specific.
I’m collecting a thousand user ID’s from the followers of Coke and Pepsi, each, then searching through the most recent 20 statuses of each user to collect hashtags used.
I’m using the Tweepy followers_ids and user_timeline API’s, but I keep getting 401’s from Twitter. If I set the number of user ID’s to search to only 10, instead of 1000, I sometimes get the results I want, but even then I sometimes get 401’s too. So it works…. kind of. It seems to be the large set that’s causing these errors and I don’t know how to step around them.
I know Twitter has limits on calls, but if I’m able to grab 1000 user ID’s fairly instantaneously, why can’t I grab the statuses? I realize I’m trying to get 20,000 statuses, but I’ve tried this with only 100*20 and even 50*20 and still get 401’s. I’ve reset my system clock multiple times but that only works occasionally with the 10*20 set. I’m hoping somebody out there might have a better, more efficient way to do this than what I have so far. I’m brand new to the Twitter API, and fairly new to Python so hopefully it’s just me.
Here’s the code:
import tweepy
import pandas as pd
consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'
access_token = 'REDACTED'
access_token_secret = 'REDACTED'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
pepsiUsers = []
cokeUsers = []
cur_pepsiUsers = tweepy.Cursor(api.followers_ids, screen_name='pepsi')
cur_cokeUsers = tweepy.Cursor(api.followers_ids, screen_name='CocaCola')
for user in cur_pepsiUsers.items(1000):
pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
for status in tweepy.Cursor(api.user_timeline, user).items(20):
status = status._json
hashtags = status['entities']['hashtags']
index = len(pepsiUsers) - 1
if len(hashtags) > 1:
for ht in hashtags:
pepsiUsers[index]['hTags'].append(ht['text'])
for user in cur_cokeUsers.items(1000):
cokeUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Coke' })
for status in tweepy.Cursor(api.user_timeline, user).items(20):
status = status._json
hashtags = status['entities']['hashtags']
index = len(cokeUsers) - 1
if len(hashtags) > 1:
for ht in hashtags:
cokeUsers[index]['hTags'].append(ht['text'])
"""create a master list of coke and pepsi users to write to CSV"""
mergedList = cokeUsers + pepsiUsers
"""here we'll turn empty hashtag lists into blanks and turn all hashtags for each user into a single string
for easier searching with R later"""
for i in mergedList:
if len(i['hTags']) == 0:
i['hTags'] = ''
i['hTags'] = ''.join(i['hTags'])
list_df = pd.DataFrame(mergedList, columns=['userId', 'favSoda', 'hTags'])
list_df.to_csv('test.csv', index=False)
And here’s the error I’m getting when I try to run those for blocks that run the api.user_timeline code
---------------------------------------------------------------------------
TweepError Traceback (most recent call last)
<ipython-input-134-a7658ed899f3> in <module>()
3 for user in cur_pepsiUsers.items(1000):
4 pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
----> 5 for status in tweepy.Cursor(api.user_timeline, user).items(20):
6 status = status._json
7 hashtags = status['entities']['hashtags']
/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in __next__(self)
47
48 def __next__(self):
---> 49 return self.next()
50
51 def next(self):
/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
195 if self.current_page is None or self.page_index == len(self.current_page) - 1:
196 # Reached end of current page, get the next page...
--> 197 self.current_page = self.page_iterator.next()
198 self.page_index = -1
199 self.page_index += 1
/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
106
107 if self.index >= len(self.results) - 1:
--> 108 data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kargs)
109
110 if hasattr(self.method, '__self__'):
/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in _call(*args, **kwargs)
243 return method
244 else:
--> 245 return method.execute()
246
247 # Set pagination mode
/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in execute(self)
227 raise RateLimitError(error_msg, resp)
228 else:
--> 229 raise TweepError(error_msg, resp, api_code=api_error_code)
230
231 # Parse the response payload
TweepError: Twitter error response: status code = 401
2
Answers
Do you only need the Twitter JSON? Because of the scope of your collecting area, you may want to try twarc: https://github.com/edsu/twarc
try adding a rate limit when creating the API.
if this does not totally fix the problem use (try and exception) in python to capture the error and wait for some time like 15 minutes before going back.