skip to Main Content

I want to extract shortened URLs from tweets if any. These URLs follow a standard form:http://t.co (details here)

For this, I used the following regex expression which works fine when I tested it with tweet text by just storing the text as a string.

NOTE:
I am using https://shortnedurl/string instead of the real shortened URL because StackOverflow does not allow posting such URLs here.

Sample code:

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url 

The output of this code:

printing urls https://shortnedurl.com

However, when I read the tweet from twitter using its API and run the same regex on it, I get the following output which is undesirable.

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

It seems like it’s getting the URL for twitter ID, as well as a hashtag.

How can I deal with this problem? I just want to read only these http://t.co URLs.

UPDATE1:
I tried https?://t.co/S*, however, I am still getting the following noisy url:

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

I do not know why the same URL is found again with the </a><span>.

For the https?://t.co/S+, I get invalid URLs because it combines both of these above URLs in one:

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

Update2:
The tweet text looks a bit different what I expected:

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>

2

Answers


  1. If I understand you correctly, just put the string you want to have contained in your regex, like so:

    https?://shortnedurl.com/S*
    # look for http or https:://
    # shortnedurl.com/ literally
    # followed by anything not a whitespace character, 0+
    

    See a demo on regex101.com.
    For your special case:

    https?://t.co/S*
    
    Login or Signup to reply.
  2. you can use the regex

    https?://t.co/S+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search