I am very new to python so I’m looking for help with this problem. My goal is to collect roughly 10,000 tweets that contain images and save it into a csv file. Since Twitter’s rate limit is 450 requests per 15mins, ideally I want to automate this process. The guides I’ve seen only have used the tweepy module but since I didn’t quite understand it so I’ve used the sample python code given on Twitter:
import requests
import pandas as pd
import os
import json
# To set your enviornment variables in your terminal run the following line:
os.environ['BEARER_TOKEN']=''
def auth():
return os.environ.get("BEARER_TOKEN")
def create_url():
query = "has:images lang:en -is:retweet"
tweet_fields = "tweet.fields=attachments,created_at,author_id"
expansions = "expansions=attachments.media_keys"
media_fields = "media.fields=media_key,preview_image_url,type,url"
max_results = "max_results=100"
url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}&{}&{}&{}".format(
query, tweet_fields, expansions, media_fields, max_results
)
return url
def create_headers(bearer_token):
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def connect_to_endpoint(url, headers):
response = requests.request("GET", url, headers=headers)
print(response.status_code)
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
def save_json(file_name, file_content):
with open(file_name, 'w', encoding='utf-8') as write_file:
json.dump(file_content, write_file, sort_keys=True, ensure_ascii=False, indent=4)
def main():
bearer_token = auth()
url = create_url()
headers = create_headers(bearer_token)
json_response = connect_to_endpoint(url, headers)
#Save the data as a json file
#save_json('collected_tweets.json', json_response)
#save tweets as csv
#df = pd.json_normalize(data=json_response)
df1 = pd.DataFrame(json_response['data'])
df1.to_csv('tweets_data.csv', mode="a")
df2 = pd.DataFrame(json_response['includes'])
df2.to_csv('tweets_includes_media.csv', mode="a")
print(json.dumps(json_response['meta'], sort_keys=True, indent=4))
if __name__ == "__main__":
main()
How should I alter this code such that it will loop within Twitter’s v2 rate limits or would it be better to use tweepy instead?
As a side note, I do realise my code to save as csv has issues but this is the best I can do right now.
2
Answers
Try using a scheduler:
is the amount of time between two events.
is the method to execute every 16 minutes (in this example).
Consider the exact time and the exact method to repeat.
There are a couple of things to keep in mind here.
"has:images lang:en -is:retweet"
) and just gathers those tweets in real-time. If you are trying to get the full record of non-retweet, English tweets between two periods of time, you will need to add those points in time to your query and then manage the limits as you requested above. Check outstart_time
andend_time
in the API reference docs.connect_to_endpoint(url, headers)
function. Also, you can use another functionpause_until
, written for a Twitter V2 API package I am in the process of developing (link to function code).Since the full query result will be much larger than the 100 tweets you are requesting at each query, you need to keep track of your location in the larger query. This is done via a
next_token
.To get the
next_token
, it’s actually quite easy. Simply grab it from the meta field in the response. To be clear, you can use the above function like so…Then this token needs to be passed in the query details, which are contained in the url you create with your
create_url()
function. That means you’ll also need to update yourcreate_url()
function to something like the below…After altering the above functions your code should flow in the following manner.
next_token
fromresponse["meta"]["next_token"]
next_token
withcreate_url()
Final note: I would not try to work with pandas dataframes to write your file. I would create an empty list, append the results from each new query to that list, and then write the final list of dictionary objects to a json file (see this question for details). I’ve learned the hard way that raw tweets and pandas dataframes don’t really play nice. Much better to get used to how json objects and dictionaries work.