I am trying to use since_id to get tweets using twitter search api. Below is my code, here I am creating a map of query object and since id. I am defaulting the since id to 0 and my goal is to update the since id every time I run the query. So that when next time I am going to run the query it does not get the same tweets and should start from the last tweet.
import java.io.{PrintWriter, StringWriter}
import java.util.Properties
import com.google.common.io.Resources
import twitter4j._
import scala.collection.JavaConversions._
// reference: http://bcomposes.com/2013/02/09/using-twitter4j-with-scala-to-access-streaming-tweets/
object Util {
val props = Resources.getResource("twitter4j.props").openStream()
val properties = new Properties()
properties.load(props)
val config = new twitter4j.conf.ConfigurationBuilder()
.setDebugEnabled(properties.getProperty("debug").toBoolean)
.setOAuthConsumerKey(properties.getProperty("consumerKey"))
.setOAuthConsumerSecret(properties.getProperty("consumerSecret"))
.setOAuthAccessToken(properties.getProperty("accessToken"))
.setOAuthAccessTokenSecret(properties.getProperty("accessTokenSecret"))
val tempKeys =List("Yahoo","Bloomberg","Messi", "JPM Chase","Facebook")
val sinceIDmap : scala.collection.mutable.Map[String, Long] = collection.mutable.Map(tempKeys map { ix => s"$ix" -> 0.toLong } : _*)
//val tweetsMap: scala.collection.mutable.Map[String, String]
val configBuild = (config.build())
val MAX_TWEET=100
getTweets()
def getTweets(): Unit ={
sinceIDmap.keys.foreach((TickerId) => getTweets(TickerId))
}
def getTweets(TickerId: String): scala.collection.mutable.Map[String, scala.collection.mutable.Buffer[String]] = {
println("Search key is:"+TickerId)
var tweets = scala.collection.mutable.Map[String, scala.collection.mutable.Buffer[String]]()
try {
val twitter: Twitter = new TwitterFactory(configBuild).getInstance
val query = new Query(TickerId)
query.setSinceId(sinceIDmap.get(TickerId).get)
query.setLang("en")
query.setCount(MAX_TWEET)
val result = twitter.search(query)
tweets += ( TickerId -> result.getTweets().map(_.getText))
//sinceIDmap(TickerId)=result.getSinceId
println("-----------Since id is :"+result.getSinceId )
//println(tweets)
}
catch {
case te: TwitterException =>
println("Failed to search tweets: " + te.getMessage)
}
tweets
}
}
object StatusStreamer {
def main(args: Array[String]) {
Util
}
}
output:
Search key is:Yahoo
log4j:WARN No appenders could be found for logger (twitter4j.HttpClientImpl).
log4j:WARN Please initialize the log4j system properly.
-----------Since id is :0
Search key is:JPM Chase
-----------Since id is :0
Search key is:Facebook
-----------Since id is :0
Search key is:Bloomberg
-----------Since id is :0
Search key is:Messi
-----------Since id is :0
The problem is when I am trying to print the since id after running the query it gives the same value what I am setting initially. Can someone point me what I a doing wrong here? or if my approach is wrong can someone share any other approach if they know would work here.
Thanks
4
Answers
From a cursory look at your code, it appears that you never update the values in
sinceIDmap
. You have commented out the following:So, per keyword, the
since_id
is never updated from0
.If you’re having problems beyond that, it might be worth examining the Twitter4J
SearchTweets
example on GitHub.Twitter API returns
since_id
value that was originally requested by query. It means thatQueryResult.getSinceId
is the same as you put inQuery
.The simplest solution would be to set next
sinceId
as maximum tweet id from response.In general to make results more smooth you could use combination of
since_id
andmax_id
query parameters. Official twitter guide has very good explanation how to use them.First of all, from the initial description of your approach, I can tell you that your approach using
since_id
is incorrect. I have made the same mistake in the past and could not get it to work. Furthermore, your approach is not in line with the official Working with Timelines. The official guidelines worked for me then and I suggest you follow them. Long story short, you cannot usesince_id
alone to move through a timeline of tweets (the timeline thatGET search / tweets
returns, in your case). You definitely needmax_id
to do what you describe. And, actually, I believe thatsince_id
has a totally secondary/optional function (which could be implemented in your code as well). The API docs made me believe I could usesince_id
exactly like I could usemax_id
, but I was wrong. Specifying onlysince_id
, I had noticed that the returned tweets were extremely fresh, as ifsince_id
had been completely disregarded. Here is another question that demonstrates this unexpected behavior. As I see it,since_id
is used only for pruning, not for moving through the timeline. Usingsince_id
alone will get the freshest/latest tweets, but restrict the tweets that are returned to the ones that have an ID that is greater thansince_id
. Not what you would want. A final piece of evidence, taken from the official guidelines, is this graphical representation of a specific request:Not only does
since_id
not move you through the timeline, but it happens to be completely useless in this specific request. It will not be useless in the next request, though, since it will pruneTweet 10
(and anything before that). But the fact is thatsince_id
does not move you through the timeline.In general you need to think in going from latest tweets to oldest tweets, not the other way around. And to go from latest tweets to oldest tweets, in your request you need to specify
max_id
as the ID inclusive upper bound for the tweets to be returned and to update this parameter between successive requests.The existence of
max_id
in a request will set the ID inclusive upper bound for the tweets to be returned. From the returned tweets you can get the smallest ID that appears and use it as the value formax_id
in the subsequent request (you can decrement the smallest ID by one and use this value for the next request’smax_id
, sincemax_id
is inclusive, so that you will not get the oldest tweet from the previous request again). The first ever request should havemax_id
not specified, so that the freshest/latest tweets will be returned. Using this approach, each request after the first request will take you one step deeper into the past.since_id
can come in handy when you need to restrict your trip to the past. Imagine that at some point in time,t0
, you start searching tweets. Let us assume that from your first search your biggest tweet ID isid0
. After that first search, all tweet IDs in subsequent searches will be getting smaller, because you are going back. After some time, you will have got about a week’s tweets and a search of yours will return nothing. At that point in time,t1
, you know that this trip to the past is over. But, betweent0
andt1
more tweets will have been tweeted. So, another trip to the past should start att1
, until you have reached tweet with IDid0
(which has been tweeted beforet0
). This trip can be restricted by usingid0
forsince_id
in the trip’s requests, and so on. Alternatively, the use ofsince_id
could be avoided, if you made sure that your trip would end once you got a tweet that had an ID that was less than or equal toid0
(have in mind that tweets can be deleted). But I suggest you try to usesince_id
for ease and efficiency. Just remember thatsince_id
is exclusive whilemax_id
is inclusive.For more information, see the official Working with Timelines. You will notice that the section “The max_id parameter” comes first and the section “Using since_id for the greatest efficiency” comes later. The latter section’s title indicates that
since_id
is not for moving through a timeline.A coarse untested example, using Twitter4J in Java, that prints tweets starting from the latest and going to the past is the following:
There is no error handling, no special condition checking (e.g. empty list of tweets in query result), and no use of
since_id
, but it should get you started.since_id and max_id are both very simple parameters you can use to limit what you get back from the API. From the docs:
since_id – Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available. max_id – Returns results with an ID less than (that is, older than) or equal to the specified ID.
So, if you have a given tweet ID, you can search for older or newer tweets by using these two parameters.
count is even simpler — it specifies a maximum number of tweets you want to get back, up to 200.
Unfortunately the API will not give you back exactly what you want — you cannot specify a date/time when querying user_timeline — although you can specify one when using the search API. Anyway, if you need to use user_timeline, then you will need to poll the API, gathering up tweets, figuring out if they match the parameters you desire, and then calculating your stats accordingly.