How to extract or grab all shortened URLs from a tweet? - Twitter API

utengr
August 23, 2017
116 views
0 votes
2 Answers

I want to extract shortened URLs from tweets if any. These URLs follow a standard form:http://t.co (details here)

For this, I used the following regex expression which works fine when I tested it with tweet text by just storing the text as a string.

NOTE:
I am using https://shortnedurl/string instead of the real shortened URL because StackOverflow does not allow posting such URLs here.

Sample code:

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url

The output of this code:

printing urls https://shortnedurl.com

However, when I read the tweet from twitter using its API and run the same regex on it, I get the following output which is undesirable.

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

It seems like it’s getting the URL for twitter ID, as well as a hashtag.

How can I deal with this problem? I just want to read only these http://t.co URLs.

UPDATE1:
I tried https?://t.co/S*, however, I am still getting the following noisy url:

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

I do not know why the same URL is found again with the </a><span>.

For the https?://t.co/S+, I get invalid URLs because it combines both of these above URLs in one:

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

Update2:
The tweet text looks a bit different what I expected:

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>

Answers

- Jan
- August 23, 2017 at 11:23 am
- 0 votes
0
If I understand you correctly, just put the string you want to have contained in your regex, like so:
```
https?://shortnedurl.com/S*
# look for http or https:://
# shortnedurl.com/ literally
# followed by anything not a whitespace character, 0+
```
See a demo on regex101.com.
For your special case:
```
https?://t.co/S*
```
Login or Signup to reply.