skip to Main Content

I’m working with twitter ids which are strings because they are so huge.

Twitter’s api has a “Since_id” and I want to search tweets since the earliest tweet in a list.

For example:

tweet_ids = [u'1003659997241401843', u'1003659997241401234234', u'100365999724140136236'] # etc
since_id = min(tweet_ids)

So far min(tweet_ids) works but I want to understand why it works because I want to know if it is just by chance that it worked on the few samples I gave it, or if it is guaranteed to always work.

Edit: To clarify I need to get the lowest tweet id. How do I get the lowest tweet id if they are strings that are > 2^32-1 and therefore can’t be represented as integers in python 2.7 on a 32 bit machine.

I am using python 2.7 if that matters

2

Answers


  1. Python will compare these strings exactly as it compares any other strings; that is, it will compare them lexicographically.

    Thus, it will put 12 before 2, which may be undesirable for you.

    Here’s a function that will compute the numerical minimum of strings representing integers for you.

    # A is an iterable of strings representing integers.
    def numerical_min(A):
        cur_min = A[0]
        for x in A[1:]:
            if len(x) < len(cur_min):
                cur_min = x
                continue
            if len(x) > len(cur_min):
                continue
            for m,n in zip(x, cur_min):
                if int(m) < int(n):
                    cur_min = x
                    break
        return cur_min
    
    Login or Signup to reply.
  2. From the Python Documentation, it implies that all Strings, including your case where the strings are large sequences of digits, are compared lexicographically.

    • The “lesser integer” string 2 is less than then “greater integer” string 100 in this case.
    • Negative integers sorted lexicographically are “greater” than positive integers. "-1" is greater than "99" when compared this way because the minus hyphen is lexicographically greater than all digits.
    • Equal integers "2" and "02" aren’t necessarily equal in terms of string comparison. "02" is less than "2" string-wise because of the leading zero.

    It is better to convert the str into a long int, and then compare it. As in

    • tweet_ids = [long('1003659997241401843'), long('1003659997241401234234'), long('100365999724140136236')]
    • since_id = min(tweet_ids)

    Since JSON does not allow 70-bit long ints, convert the smallest int back into a str. Replace the since_id line with

    • since_id = min(tweet_ids, key=int)
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search