skip to Main Content

I am trying to stream data from twitter to an aws bucket. The good news is I can get the data to stream to my bucket but the data comes in approx 20 kb chunks (I think this may be due to some firehose settings) and its not saving as JSON even after I specified it to in my python code using JSON.LOAD. Rather than saving as JSON, the data in my S3 bucket looks like it does not have a file extension and has long string of alphanumeric characters. I think it may be something to do with the parameters being used in client.put_record()

Any help is greatly appreciated!

Please find my code below, which I got from github here.


from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
import boto3
import time


#Variables that contains the user credentials to access Twitter API
consumer_key = "MY_CONSUMER_KEY"
consumer_secret = "MY_CONSUMER_SECRET"
access_token = "MY_ACCESS_TOKEN"
access_token_secret = "MY_SECRET_ACCESS_TOKEN"


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        tweet = json.loads(data)
        try:
            if 'extended_tweet' in tweet.keys():
                #print (tweet['text'])
                message_lst = [str(tweet['id']),
                       str(tweet['user']['name']),
                       str(tweet['user']['screen_name']),
                       tweet['extended_tweet']['full_text'],
                       str(tweet['user']['followers_count']),
                       str(tweet['user']['location']),
                       str(tweet['geo']),
                       str(tweet['created_at']),
                       'n'
                       ]
                message = 't'.join(message_lst)
                print(message)
                client.put_record(
                    DeliveryStreamName=delivery_stream,
                    Record={
                    'Data': message
                    }
                )
            elif 'text' in tweet.keys():
                #print (tweet['text'])
                message_lst = [str(tweet['id']),
                       str(tweet['user']['name']),
                       str(tweet['user']['screen_name']),
                       tweet['text'].replace('n',' ').replace('r',' '),
                       str(tweet['user']['followers_count']),
                       str(tweet['user']['location']),
                       str(tweet['geo']),
                       str(tweet['created_at']),
                       'n'
                       ]
                message = 't'.join(message_lst)
                print(message)
                client.put_record(
                    DeliveryStreamName=delivery_stream,
                    Record={
                    'Data': message
                    }
                )
        except (AttributeError, Exception) as e:
                print (e)
        return True

    def on_error(self, status):
        print (status)
        
        
        
        
        
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    listener = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    #tweets = Table('tweets_ft',connection=conn)
    client = boto3.client('firehose', 
                          region_name='us-east-1',
                          aws_access_key_id='MY ACCESS KEY',
                          aws_secret_access_key='MY SECRET KEY' 
                          )

    delivery_stream = 'my_firehose'
    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    #stream.filter(track=['trump'], stall_warnings=True)
    while True:
        try:
            print('Twitter streaming...')
            stream = Stream(auth, listener)
            stream.filter(track=['brexit'], languages=['en'], stall_warnings=True)
        except Exception as e:
            print(e)
            print('Disconnected...')
            time.sleep(5)
            continue   

2

Answers


  1. Chosen as BEST ANSWER

    So it looks like the files were coming on with JSON formatting, i just had to open the files in S3 with firefox and i could see the contents of files. The issue with the file sizes is due to the firehose buffer settings, i have them set to the lowest which is why files were being sent in such small chunks


  2. Its possible that you have enabled S3 compression for your firehose. Please ensure that the compression is disabled if you want to store raw json data in your bucket:

    enter image description here

    You could also have some transformation applied to your firehose which encode or otherwise transform your json messages into some other format.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search