Html - I continue to get empty [ ] while scrapping for 'href' in an <a tag in python

KIZMAN
February 6, 2024
161 views
0 votes
2 Answers

I am trying to get link for a page through a football website page. I have pulled out the <a tags but to get the 'href', it is coming out as empty.

I have tried different approaches since going through stackover.

with open("premier_league/page_2.html", encoding= 'utf-8') as f:
    page = f.read()
parsed_page = BeautifulSoup(html, "html.parser")
links = parsed_page.find_all("a")

the code above generates all the <a tags in the page. but to continue further a pull the "href" is not working.

I Initially tried:

links = [ l.get('href') for l in links]

links = [l for l in links if l and "/all_comps/shooting" in l]

i get error:

ResultSet object has no attribute 'get'. the .get function is a str property.

I changed the get function to :

links = [ l['href] for l in links]

.throws in an error that the [ ] can't be used for a str

The next was to try:
for links in parsed_page("a", href = True):
    try:
        print (links['href'])
    except KeyError:
        pass

this worked. The results:

/en/squads/b8fd03ef/2022-2023/c602/Manchester-City-Stats-FA-Community-Shield
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/schedule/Manchester-City-Scores-and-Fixtures-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/keeper/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing_types/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/gca/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/defense/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/possession/Manchester-City-Match-Logs-All-

I needed to change the above code to list comprehension but some reasons it didn’t respond:

href_list = [link['href'] for link in parsed_page("a", href=True) if 'href' in link]

href_list

[ ] - empty list

I don’t understand. why won’t it work??

I also tried another format. Because the link i want to get looks like

/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions

so:

or link in links:
    if "matchlogs/all_comps/shooting/" in link:
        link
    else:
        pass

still empty.

What am I doing wrong so far. I have used .get and links['href'] methods to try to get the href property but no avail.

Answers

- LukePoirrier
- February 6, 2024 at 2:15 pm
- 0 votes
0
You’re super close, but you just need to narrow down your search. Right now you’re scraping the entirety of each of the links, and the end up with a few layers of XML(I believe, it’s been 10 years since I’ve touched BeautifulSoup.

You just need to specify further by saying that you’re looking specifically for the href attribute inside of all of the a elements. I found another question that was pretty similar to yours that’s been answered more thoroughly, but essentially here’s the gist of it:

You can use find_all in the following way to find every a element that has an href attribute, and print each one:
```
# Python2
from BeautifulSoup import BeautifulSoup
    
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
    
soup = BeautifulSoup(html)
    
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

# The output would be:
# Found the URL: some_url
# Found the URL: another_url

# Python3
from bs4 import BeautifulSoup

html = '''<a href="https://some_url.com">next</a>
<span class="class">
<a href="https://some_other_url.com">another_url</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print("Found the URL:", a['href'])

# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com
```
I’m not sure which version of Python or BeautifulSoup you’re using, but you should pay mind to that because they changed some small things in the newer versions that could keep you help up if you don’t realize it.

BeautifulSoup getting href [duplicate]
Login or Signup to reply.

- HedgeHog
- February 6, 2024 at 3:49 pm
- 0 votes
0
I needed to change the above code to list comprehension but some reasons it didn’t respond:
```
href_list = [link['href'] for link in parsed_page("a", href=True) if 'href' in link]
```
This would not result as expected because you try to check the object against a string. So you better would try following to get your list of links:
```
[link.get('href') for link in parsed_page("a", href=True)]
```
or check against the value of each href
```
[link.get('href') for link in parsed_page("a", href=True) if 'matchlogs/all_comps/shooting' in link.get('href')]
```
You could also directly search for the string that should be contained in the href value (css selectors):
```
[link.get('href') for link in soup.select('a[href*="matchlogs/all_comps/shooting"]')]
```
Example

Not exactly known what page you are starting from, so I used the following.
```
import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://fbref.com/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions').text)

[link.get('href') for link in soup.select('a[href*="matchlogs/all_comps/shooting"]')]
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – I continue to get empty [ ] while scrapping for 'href' in an <a tag in python

Answers

Example