I want scheduled to run my python script every hour and save the data in elasticsearch index. So that I used a function I wrote, set_interval which uses the tweepy library. But it doesn’t work as I need it to work. It runs every minute and save the data in index. Even after the set that seconds equal to 3600 it runs in every minute. But I want to configure this to run on an hourly basis.
How can I fix this? Heres my python script:
def call_at_interval(time, callback, args):
while True:
timer = Timer(time, callback, args=args)
timer.start()
timer.join()
def set_interval(time, callback, *args):
Thread(target=call_at_interval, args=(time, callback, args)).start()
def get_all_tweets(screen_name):
# authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
screen_name = ""
# initialize a list to hold all the tweepy Tweets
alltweets = []
# make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
# save most recent tweets
alltweets.extend(new_tweets)
# save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
# keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print
#"getting tweets before %s" % (oldest)
# all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest)
# save most recent tweets
alltweets.extend(new_tweets)
# update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print
#"...%s tweets downloaded so far" % (len(alltweets))
outtweets = [{'ID': tweet.id_str, 'Text': tweet.text, 'Date': tweet.created_at, 'author': tweet.user.screen_name} for tweet in alltweets]
def save_es(outtweets, es): # Peps8 convention
data = [ # Please without s in data
{
"_index": "index name",
"_type": "type name",
"_id": index,
"_source": ID
}
for index, ID in enumerate(outtweets)
]
helpers.bulk(es, data)
save_es(outtweets, es)
print('Run at:')
print(datetime.now())
print("n")
set_interval(3600, get_all_tweets(screen_name))
2
Answers
Get rid of all timer code just write the logic and
cron will do the job for you add this to the end of the file after
crontab -e
0 * * * *
means run at every zero minute you can find more explanation hereAnd also I noticed you are recursively calling
get_all_tweets(screen_name)
I think you might have to call it from outsideJust keep your script this much
Why do you need so much complexity to do some task every hour? You can run script every one hour this way below, note that it is runned 1 hour + time to do work:
If you want to run script exactly every one hour, do the following code below:
In this case thr is supposed to finish it’s work faster than 3600 seconds, though it does not, you’ll still get results, but results will be from another attempt, see the example below:
The result you’ll gey in the case is:
Do some work 1
Do some work 1
Do some work 1
Do some work 1
Some work is done! 1
Do some work 2
Some work is done! 2
Do some work 3
Some work is done! 3
Do some work 4
Some work is done! 4
Do some work 5
Some work is done! 5
Do some work 6
Some work is done! 6
Do some work 7
Some work is done! 7
Do some work 8
Some work is done! 8
Do some work 9
I like using subprocess.Popen for such tasks, if the child subprocess did not finish it’s work within one hour due to any reason, you just terminate it and start a new one.
You also can use CRON to schedule some process to run every one hour.