skip to Main Content

I am trying to get link for a page through a football website page. I have pulled out the <a tags but to get the 'href', it is coming out as empty.

I have tried different approaches since going through stackover.

with open("premier_league/page_2.html", encoding= 'utf-8') as f:
    page = f.read()
parsed_page = BeautifulSoup(html, "html.parser")
links = parsed_page.find_all("a")

the code above generates all the <a tags in the page. but to continue further a pull the "href" is not working.

I Initially tried:

links = [ l.get('href') for l in links]
links = [l for l in links if l and "/all_comps/shooting" in l]

i get error:

ResultSet object has no attribute 'get'. the .get function is a str property.

I changed the get function to :

links = [ l['href] for l in links]

.throws in an error that the [ ] can't be used for a str

The next was to try:
for links in parsed_page("a", href = True):
    try:
        print (links['href'])
    except KeyError:
        pass
this worked. The results:

/en/squads/b8fd03ef/2022-2023/c602/Manchester-City-Stats-FA-Community-Shield
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/schedule/Manchester-City-Scores-and-Fixtures-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/keeper/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing_types/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/gca/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/defense/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/possession/Manchester-City-Match-Logs-All-

I needed to change the above code to list comprehension but some reasons it didn’t respond:

href_list = [link['href'] for link in parsed_page("a", href=True) if 'href' in link]
href_list
[ ] - empty list

I don’t understand. why won’t it work??

I also tried another format. Because the link i want to get looks like

/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions

so:

or link in links:
    if "matchlogs/all_comps/shooting/" in link:
        link
    else:
        pass

still empty.

What am I doing wrong so far. I have used .get and links['href'] methods to try to get the href property but no avail.

2

Answers


  1. You’re super close, but you just need to narrow down your search. Right now you’re scraping the entirety of each of the links, and the end up with a few layers of XML(I believe, it’s been 10 years since I’ve touched BeautifulSoup.

    You just need to specify further by saying that you’re looking specifically for the href attribute inside of all of the a elements. I found another question that was pretty similar to yours that’s been answered more thoroughly, but essentially here’s the gist of it:

    You can use find_all in the following way to find every a element that has an href attribute, and print each one:

    # Python2
    from BeautifulSoup import BeautifulSoup
        
    html = '''<a href="some_url">next</a>
    <span class="class"><a href="another_url">later</a></span>'''
        
    soup = BeautifulSoup(html)
        
    for a in soup.find_all('a', href=True):
        print "Found the URL:", a['href']
    
    # The output would be:
    # Found the URL: some_url
    # Found the URL: another_url
    
    # Python3
    from bs4 import BeautifulSoup
    
    html = '''<a href="https://some_url.com">next</a>
    <span class="class">
    <a href="https://some_other_url.com">another_url</a></span>'''
    
    soup = BeautifulSoup(html)
    
    for a in soup.find_all('a', href=True):
        print("Found the URL:", a['href'])
    
    # The output would be:
    # Found the URL: https://some_url.com
    # Found the URL: https://some_other_url.com
    

    I’m not sure which version of Python or BeautifulSoup you’re using, but you should pay mind to that because they changed some small things in the newer versions that could keep you help up if you don’t realize it.

    BeautifulSoup getting href [duplicate]

    Login or Signup to reply.
  2. I needed to change the above code to list comprehension but some reasons it didn’t respond:

    href_list = [link['href'] for link in parsed_page("a", href=True) if 'href' in link]
    

    This would not result as expected because you try to check the object against a string. So you better would try following to get your list of links:

    [link.get('href') for link in parsed_page("a", href=True)]
    

    or check against the value of each href

    [link.get('href') for link in parsed_page("a", href=True) if 'matchlogs/all_comps/shooting' in link.get('href')]
    

    You could also directly search for the string that should be contained in the href value (css selectors):

    [link.get('href') for link in soup.select('a[href*="matchlogs/all_comps/shooting"]')]
    
    Example

    Not exactly known what page you are starting from, so I used the following.

    import requests
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(requests.get('https://fbref.com/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions').text)
    
    [link.get('href') for link in soup.select('a[href*="matchlogs/all_comps/shooting"]')]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search