skip to Main Content
  • I’m doing a following link activity on Python ( it’s an assignment on Python Web Access Data – Coursera). Here is the problem:
  • In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
  • Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Armen.html
    Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
    Hint: The first character of the name of the last page that you will load is: F

So I basically copy most of the code in the given link and come up with something on my own in the end. It looks like this:

import urllib.request, urllib.parse, urllib.error
import collections
collections.Callable = collections.abc.Callable
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
for i in range(7):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        tag = tags[17]
        link = tag.get('href', None)
        print(link)
    url = link

And I got the result: but it printed out like this:

http://py4e-data.dr-chuck.net/known_by_Hailie.html
… ( it duplicates feel like a hundred of times)
http://py4e-data.dr-chuck.net/known_by_Hailie.html

http://py4e-data.dr-chuck.net/known_by_Felicity.html

Which part of my code lead to this problem?
How can I cut it off?

2

Answers


  1. If I understand correctly the issue is printing information multiple times, so take a closer look in that part of code:

    for tag in tags:
        tag = tags[17]
        link = tag.get('href', None)
        print(link)
    

    Because you pick a specific position tags[17], there is no need to loop over all the tags and print the same information over and over again in each iteration, so simply skip the loop:

    tag = tags[17]
    link = tag.get('href', None)
    print(link)
    

    be aware, this will not cover the main task and the envisaged solution path, just your specific issue

    Login or Signup to reply.
  2. For me the problem comes from your second loop on the tags, you don’t need them, as you do it prints you the url as many times as there are <a>

    import urllib.request, urllib.parse, urllib.error
    import collections
    collections.Callable = collections.abc.Callable
    from bs4 import BeautifulSoup
    import ssl
    
    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    url = input('Enter - ')
    for i in range(7):
        html = urllib.request.urlopen(url, context=ctx).read()
        soup = BeautifulSoup(html, 'html.parser')
        tags = soup('a')
        tag = tags[17]
        link = tag.get('href', None)
        print(link)
        url = link
    

    This code should only give you the 7 url through which you will go

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search