skip to Main Content

I am trying to use since_id to get tweets using twitter search api. Below is my code, here I am creating a map of query object and since id. I am defaulting the since id to 0 and my goal is to update the since id every time I run the query. So that when next time I am going to run the query it does not get the same tweets and should start from the last tweet.

import java.io.{PrintWriter, StringWriter}
import java.util.Properties
import com.google.common.io.Resources
import twitter4j._
import scala.collection.JavaConversions._
// reference: http://bcomposes.com/2013/02/09/using-twitter4j-with-scala-to-access-streaming-tweets/
object Util {
    val props = Resources.getResource("twitter4j.props").openStream()
    val properties = new Properties()
    properties.load(props)

    val config = new twitter4j.conf.ConfigurationBuilder()
        .setDebugEnabled(properties.getProperty("debug").toBoolean)
        .setOAuthConsumerKey(properties.getProperty("consumerKey"))
        .setOAuthConsumerSecret(properties.getProperty("consumerSecret"))
        .setOAuthAccessToken(properties.getProperty("accessToken"))
        .setOAuthAccessTokenSecret(properties.getProperty("accessTokenSecret"))
    val tempKeys =List("Yahoo","Bloomberg","Messi", "JPM Chase","Facebook")
    val sinceIDmap : scala.collection.mutable.Map[String, Long] = collection.mutable.Map(tempKeys map { ix => s"$ix" -> 0.toLong } : _*)
    //val tweetsMap: scala.collection.mutable.Map[String, String]
    val configBuild = (config.build())
    val MAX_TWEET=100
    getTweets()

    def getTweets(): Unit ={
        sinceIDmap.keys.foreach((TickerId) => getTweets(TickerId))
    }

    def getTweets(TickerId: String): scala.collection.mutable.Map[String, scala.collection.mutable.Buffer[String]] = {
        println("Search key is:"+TickerId)
        var tweets = scala.collection.mutable.Map[String, scala.collection.mutable.Buffer[String]]()
        try {
            val twitter: Twitter = new TwitterFactory(configBuild).getInstance
            val query = new Query(TickerId)
            query.setSinceId(sinceIDmap.get(TickerId).get)
            query.setLang("en")
            query.setCount(MAX_TWEET)
            val result = twitter.search(query)
            tweets += ( TickerId -> result.getTweets().map(_.getText))

            //sinceIDmap(TickerId)=result.getSinceId
            println("-----------Since id is :"+result.getSinceId )
            //println(tweets)
        }
        catch {
            case te: TwitterException =>
                println("Failed to search tweets: " + te.getMessage)
        }
        tweets
    }
}

object StatusStreamer {
    def main(args: Array[String]) {
        Util
    }
}

output:

Search key is:Yahoo    
log4j:WARN No appenders could be found for logger (twitter4j.HttpClientImpl).
log4j:WARN Please initialize the log4j system properly.
-----------Since id is :0
Search key is:JPM Chase
-----------Since id is :0
Search key is:Facebook
-----------Since id is :0
Search key is:Bloomberg
-----------Since id is :0
Search key is:Messi
-----------Since id is :0

The problem is when I am trying to print the since id after running the query it gives the same value what I am setting initially. Can someone point me what I a doing wrong here? or if my approach is wrong can someone share any other approach if they know would work here.

Thanks

4

Answers


  1. From a cursory look at your code, it appears that you never update the values in sinceIDmap. You have commented out the following:

    //sinceIDmap(TickerId)=result.getSinceId
    

    So, per keyword, the since_id is never updated from 0.

    If you’re having problems beyond that, it might be worth examining the Twitter4J SearchTweets example on GitHub.

    Login or Signup to reply.
  2. Twitter API returns since_id value that was originally requested by query. It means that QueryResult.getSinceId is the same as you put in Query.

    The simplest solution would be to set next sinceId as maximum tweet id from response.

    sinceIDmap(TickerId) = result.getTweets().max(Ordering.by(_.getId)).getId
    

    In general to make results more smooth you could use combination of since_id and max_id query parameters. Official twitter guide has very good explanation how to use them.

    Login or Signup to reply.
  3. First of all, from the initial description of your approach, I can tell you that your approach using since_id is incorrect. I have made the same mistake in the past and could not get it to work. Furthermore, your approach is not in line with the official Working with Timelines. The official guidelines worked for me then and I suggest you follow them. Long story short, you cannot use since_id alone to move through a timeline of tweets (the timeline that GET search / tweets returns, in your case). You definitely need max_id to do what you describe. And, actually, I believe that since_id has a totally secondary/optional function (which could be implemented in your code as well). The API docs made me believe I could use since_id exactly like I could use max_id, but I was wrong. Specifying only since_id, I had noticed that the returned tweets were extremely fresh, as if since_id had been completely disregarded. Here is another question that demonstrates this unexpected behavior. As I see it, since_id is used only for pruning, not for moving through the timeline. Using since_id alone will get the freshest/latest tweets, but restrict the tweets that are returned to the ones that have an ID that is greater than since_id. Not what you would want. A final piece of evidence, taken from the official guidelines, is this graphical representation of a specific request:

    since_id

    Not only does since_id not move you through the timeline, but it happens to be completely useless in this specific request. It will not be useless in the next request, though, since it will prune Tweet 10 (and anything before that). But the fact is that since_id does not move you through the timeline.

    In general you need to think in going from latest tweets to oldest tweets, not the other way around. And to go from latest tweets to oldest tweets, in your request you need to specify max_id as the ID inclusive upper bound for the tweets to be returned and to update this parameter between successive requests.

    The existence of max_id in a request will set the ID inclusive upper bound for the tweets to be returned. From the returned tweets you can get the smallest ID that appears and use it as the value for max_id in the subsequent request (you can decrement the smallest ID by one and use this value for the next request’s max_id, since max_id is inclusive, so that you will not get the oldest tweet from the previous request again). The first ever request should have max_id not specified, so that the freshest/latest tweets will be returned. Using this approach, each request after the first request will take you one step deeper into the past.

    since_id can come in handy when you need to restrict your trip to the past. Imagine that at some point in time, t0, you start searching tweets. Let us assume that from your first search your biggest tweet ID is id0. After that first search, all tweet IDs in subsequent searches will be getting smaller, because you are going back. After some time, you will have got about a week’s tweets and a search of yours will return nothing. At that point in time, t1, you know that this trip to the past is over. But, between t0 and t1 more tweets will have been tweeted. So, another trip to the past should start at t1, until you have reached tweet with ID id0 (which has been tweeted before t0). This trip can be restricted by using id0 for since_id in the trip’s requests, and so on. Alternatively, the use of since_id could be avoided, if you made sure that your trip would end once you got a tweet that had an ID that was less than or equal to id0 (have in mind that tweets can be deleted). But I suggest you try to use since_id for ease and efficiency. Just remember that since_id is exclusive while max_id is inclusive.

    For more information, see the official Working with Timelines. You will notice that the section “The max_id parameter” comes first and the section “Using since_id for the greatest efficiency” comes later. The latter section’s title indicates that since_id is not for moving through a timeline.

    A coarse untested example, using Twitter4J in Java, that prints tweets starting from the latest and going to the past is the following:

    // Make sure this is initialized correctly.
    Twitter twitter;
    
    /**
     * Searches and prints tweets starting from now and going back to the past.
     * 
     * @param q
     *            the search query, e.g. "#yolo"
     */
    private void searchAndPrintTweets(String q) throws TwitterException {
        // `max_id` needed by `GET search/tweets`. If it is 0 (first iteration),
        // it will not be used for the query.
        long maxId = 0;
        // Let us assume that it will run forever.
        while (true) {
            Query query = new Query();
            query.setCount(100);
            query.setLang("en");
            // Set `max_id` as an inclusive upper limit, unless this is the
            // first iteration. If this is the first iteration (maxId == 0), the
            // freshest/latest tweets will come.
            if (maxId != 0)
                query.setMaxId(maxId);
            QueryResult qr = twitter.search(query);
            printTweets(qr.getTweets());
            // For next iteration. Decrement smallest ID by 1, so that we will
            // not get the oldest tweet of this iteration in the next iteration
            // as well, since `max_id` is inclusive.
            maxId = calculateSmallestId(qr.getTweets()) - 1;
        }
    }
    
    /**
     * Calculates the smallest ID among a list of tweets.
     * 
     * @param tweets
     *            the list of tweets
     * @return the smallest ID
     */
    private long calculateSmallestId(List<Status> tweets) {
        long smallestId = Long.MAX_VALUE;
        for (Status tweet : tweets) {
            if (tweet.getId() < smallestId)
                smallestId = tweet.getId();
        }
        return smallestId;
    }
    
    /**
     * Prints the content of the tweets.
     * 
     * @param tweets
     *            the tweets
     */
    private void printTweets(List<Status> tweets) {
        for (Status tweet : tweets) {
            System.out.println(tweet.getText());
        }
    }
    

    There is no error handling, no special condition checking (e.g. empty list of tweets in query result), and no use of since_id, but it should get you started.

    Login or Signup to reply.
  4. since_id and max_id are both very simple parameters you can use to limit what you get back from the API. From the docs:

    since_id – Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available. max_id – Returns results with an ID less than (that is, older than) or equal to the specified ID.
    So, if you have a given tweet ID, you can search for older or newer tweets by using these two parameters.

    count is even simpler — it specifies a maximum number of tweets you want to get back, up to 200.

    Unfortunately the API will not give you back exactly what you want — you cannot specify a date/time when querying user_timeline — although you can specify one when using the search API. Anyway, if you need to use user_timeline, then you will need to poll the API, gathering up tweets, figuring out if they match the parameters you desire, and then calculating your stats accordingly.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search