To remove base url - Artificial Intelligence

SNandhini
August 14, 2018
233 views
0 votes
2 Answers

I wrote a python script to extract the href value from all links on a given web page:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://kteq.in/services")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

When I run the above code, I get the following output which includes both external and internal links:

index
index
#
solutions#internet-of-things
solutions#online-billing-and-payment-solutions
solutions#customer-relationship-management
solutions#enterprise-mobility
solutions#enterprise-content-management
solutions#artificial-intelligence
solutions#b2b-and-b2c-web-portals
solutions#robotics
solutions#augement-reality-virtual-reality`enter code here`
solutions#azure
solutions#omnichannel-commerce
solutions#document-management
solutions#enterprise-extranets-and-intranets
solutions#business-intelligence
solutions#enterprise-resource-planning
services
clients
contact
#
#
#
https://www.facebook.com/KTeqSolutions/
#
#
#
#
#contactform
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
index
services
#
contact
#
iOSDevelopmentServices
AndroidAppDevelopment
WindowsAppDevelopment
HybridSoftwareSolutions
CloudServices
HTML5Development
iPadAppDevelopment
services
services
services
services
services
services
contact
contact
contact
contact
contact
None
https://www.facebook.com/KTeqSolutions/
#
#
#
#

I want to remove external links that have a full URL like https://www.facebook.com/KTeqSolutions/ while keeping links like solutions#internet-of-things. How can I do this efficiently?

Answers

- Joe
- August 14, 2018 at 10:33 am
- 0 votes
0
If I understood you correctly, you can try something like:
```
l = []
for link in soup.findAll('a'):
    print link.get('href')
    l.append(link.get('href'))
l = [x for x in l if "www" not in x] #or 'https'
```
Login or Signup to reply.

- nauer
- August 14, 2018 at 10:35 am
- 0 votes
0
You can use parse_url from the requests module.
```
import requests

url = 'https://www.facebook.com/KTeqSolutions/'

requests.urllib3.util.parse_url(url)
```
gives you
```
Url(scheme='https', auth=None, host='www.facebook.com', port=None, path='/KTeqSolutions/', query=None, fragment=None)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

To remove base url – Artificial Intelligence

Answers