skip to Main Content

In my architecture on AWS, I have a service running on an EC2 instance which calls Twitter streaming API for data ingestion i.e. ingestion of real-time tweets. I call this service TwitterClient.

Twitter API uses a kindof long polling over HTTP protocol to deliver streaming data. The documentation says- a single connection is opened between your app (in my case, TwitterClient) and the API, with new tweets being sent through that connection.

TwitterClient then passes the real-time tweets to the backend (using Kinesis Data streams) for processing.

The problem I am facing is- running multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times. However, only one instance of EC2 becomes a single point of failure.

I cannot afford downtime as I can’t miss a single tweet.

What should I do to ensure high availability?

Edit: Added a brief description of how Twitter API delivers streaming data

2

Answers


  1. You can setup to run 2 EC2 behind a Load Balancer, keeping only one EC2 instance active at a time and other as passive (or backup). 2nd will be active when 1st is down.

    Login or Signup to reply.
  2. The simplest way to implement this is to run multiple EC2 instances in parallel, in different regions. You can certainly get more complex, and use heartbeats between the instances, but this is probably over-engineering.

    multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times

    Tweets have a unique message ID that can be used to deduplicate.

    I can’t miss a single tweet

    This is your real problem. Twitter limits you to a certain number of requests per 15 minute period. Assuming that you have reasonable filter rules (ie, you don’t try to read the entire tweetstream, or even the tweetstream for a broad topic), then this should be sufficient to capture all tweets.

    However, it may not be sufficient if you’re running multiple instances. You could try using two API keys (assuming that Twitter allows that) or adjust your polling frequency to something that allows multiple instances to run concurrently.

    Beware, however: as far as I know there are no guarantees. If you need guaranteed access to every relevant tweet, you would need to talk to Twitter (and be prepared to pay them for the privilege).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search