skip to Main Content

I’m reading the book, Web Scraping with Python which has the following function to retrieve external links found on a page:

#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a', {'href' : re.compile('^(http|www)((?!'+excludeUrl+').)*$')}):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

The problem is that it does not work the way it should. When i run it using the URL: http://www.oreilly.com, it returns this:

bs = makeSoup('https://www.oreilly.com') # Makes a BeautifulSoup Object
getExternalLinks(bs, 'https://www.oreilly.com') 

Output:

['https://www.oreilly.com',
 'https://oreilly.com/sign-in.html',
 'https://oreilly.com/online-learning/try-now.html',
 'https://oreilly.com/online-learning/index.html',
 'https://oreilly.com/online-learning/individuals.html',
 'https://oreilly.com/online-learning/teams.html',
 'https://oreilly.com/online-learning/enterprise.html',
 'https://oreilly.com/online-learning/government.html',
 'https://oreilly.com/online-learning/academic.html',
 'https://oreilly.com/online-learning/pricing.html',
 'https://www.oreilly.com/partner/reseller-program.html',
 'https://oreilly.com/conferences/',
 'https://oreilly.com/ideas/',
 'https://oreilly.com/about/approach.html',
 'https://www.oreilly.com/conferences/',
 'https://conferences.oreilly.com/velocity/vl-ny',
 'https://conferences.oreilly.com/artificial-intelligence/ai-eu',
 'https://www.safaribooksonline.com/public/free-trial/',
 'https://www.safaribooksonline.com/team-setup/',
 'https://www.oreilly.com/online-learning/enterprise.html',
 'https://www.oreilly.com/about/approach.html',
 'https://conferences.oreilly.com/software-architecture/sa-eu',
 'https://conferences.oreilly.com/velocity/vl-eu',
 'https://conferences.oreilly.com/software-architecture/sa-ny',
 'https://conferences.oreilly.com/strata/strata-ca',
 'http://shop.oreilly.com/category/customer-service.do',
 'https://twitter.com/oreillymedia',
 'https://www.facebook.com/OReilly/',
 'https://www.linkedin.com/company/oreilly-media',
 'https://www.youtube.com/user/OreillyMedia',
 'https://www.oreilly.com/emails/newsletters/',
 'https://itunes.apple.com/us/app/safari-to-go/id881697395',
 'https://play.google.com/store/apps/details?id=com.safariflow.queue'] 

Question:

Why are the first 16-17 entries considered “external links”? They belong to the same domain of http://www.oreilly.com.

2

Answers


  1. there is a difference between these two:

    http://www.oreilly.com
    https://www.oreilly.com
    

    hope you got my point.

    Login or Signup to reply.
  2. import urllib
    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    from urllib.parse import urlsplit
    import re
    ext = set()
    def getExt(url):
        o = urllib.parse.urlsplit(url)
        html = urlopen(url)
        bs = BeautifulSoup(html, 'html.parser')
        for link in bs.find_all('a', href = re.compile('^((https://)|(http://))')):
            if 'href' in link.attrs:
                if o.netloc in (link.attrs['href']):
                    continue
                else:
                    ext.add(link.attrs['href'])
    getExt('https://oreilly.com/')
    for i in ext:
        print(i)
                    
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search