skip to Main Content

I know similar questions have been asked about this, but the project I’m working is using Tweepy for Python so it’s a little more specific.

I’m collecting a thousand user ID’s from the followers of Coke and Pepsi, each, then searching through the most recent 20 statuses of each user to collect hashtags used.

I’m using the Tweepy followers_ids and user_timeline API’s, but I keep getting 401’s from Twitter. If I set the number of user ID’s to search to only 10, instead of 1000, I sometimes get the results I want, but even then I sometimes get 401’s too. So it works…. kind of. It seems to be the large set that’s causing these errors and I don’t know how to step around them.

I know Twitter has limits on calls, but if I’m able to grab 1000 user ID’s fairly instantaneously, why can’t I grab the statuses? I realize I’m trying to get 20,000 statuses, but I’ve tried this with only 100*20 and even 50*20 and still get 401’s. I’ve reset my system clock multiple times but that only works occasionally with the 10*20 set. I’m hoping somebody out there might have a better, more efficient way to do this than what I have so far. I’m brand new to the Twitter API, and fairly new to Python so hopefully it’s just me.

Here’s the code:

import tweepy
import pandas as pd

consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'
access_token = 'REDACTED'
access_token_secret = 'REDACTED'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

pepsiUsers = []
cokeUsers = []
cur_pepsiUsers = tweepy.Cursor(api.followers_ids, screen_name='pepsi')
cur_cokeUsers = tweepy.Cursor(api.followers_ids, screen_name='CocaCola')

for user in cur_pepsiUsers.items(1000):
    pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(pepsiUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                pepsiUsers[index]['hTags'].append(ht['text'])

for user in cur_cokeUsers.items(1000):
    cokeUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Coke' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(cokeUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                cokeUsers[index]['hTags'].append(ht['text'])

"""create a master list of coke and pepsi users to write to CSV"""
mergedList = cokeUsers + pepsiUsers
"""here we'll turn empty hashtag lists into blanks and turn all hashtags for each user into a single string
    for easier searching with R later"""
for i in mergedList:
    if len(i['hTags']) == 0:
        i['hTags'] = ''
    i['hTags'] = ''.join(i['hTags'])

list_df = pd.DataFrame(mergedList, columns=['userId', 'favSoda', 'hTags'])
list_df.to_csv('test.csv', index=False)

And here’s the error I’m getting when I try to run those for blocks that run the api.user_timeline code

---------------------------------------------------------------------------
TweepError                                Traceback (most recent call last)
<ipython-input-134-a7658ed899f3> in <module>()
      3 for user in cur_pepsiUsers.items(1000):
      4     pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
----> 5     for status in tweepy.Cursor(api.user_timeline, user).items(20):
      6         status = status._json
      7         hashtags = status['entities']['hashtags']

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in __next__(self)
     47 
     48     def __next__(self):
---> 49         return self.next()
     50 
     51     def next(self):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    195         if self.current_page is None or self.page_index == len(self.current_page) - 1:
    196             # Reached end of current page, get the next page...
--> 197             self.current_page = self.page_iterator.next()
    198             self.page_index = -1
    199         self.page_index += 1

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    106 
    107         if self.index >= len(self.results) - 1:
--> 108             data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kargs)
    109 
    110             if hasattr(self.method, '__self__'):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in _call(*args, **kwargs)
    243             return method
    244         else:
--> 245             return method.execute()
    246 
    247     # Set pagination mode

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in execute(self)
    227                     raise RateLimitError(error_msg, resp)
    228                 else:
--> 229                     raise TweepError(error_msg, resp, api_code=api_error_code)
    230 
    231             # Parse the response payload

TweepError: Twitter error response: status code = 401

2

Answers


  1. Do you only need the Twitter JSON? Because of the scope of your collecting area, you may want to try twarc: https://github.com/edsu/twarc

    Login or Signup to reply.
  2. try adding a rate limit when creating the API.

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, 
    retry_count=5, retry_delay=15)
    

    if this does not totally fix the problem use (try and exception) in python to capture the error and wait for some time like 15 minutes before going back.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search