I am trying to get link for a page through a football website page. I have pulled out the <a tags
but to get the 'href',
it is coming out as empty.
I have tried different approaches since going through stackover.
with open("premier_league/page_2.html", encoding= 'utf-8') as f:
page = f.read()
parsed_page = BeautifulSoup(html, "html.parser")
links = parsed_page.find_all("a")
the code above generates all the <a tags in the page. but to continue further a pull the "href"
is not working.
I Initially tried:
links = [ l.get('href') for l in links]
links = [l for l in links if l and "/all_comps/shooting" in l]
i get error:
ResultSet object has no attribute 'get'
. the .get
function is a str property.
I changed the get function to :
links = [ l['href] for l in links]
.throws in an error that the [ ] can't be used for a str
The next was to try:
for links in parsed_page("a", href = True):
try:
print (links['href'])
except KeyError:
pass
this worked. The results:
/en/squads/b8fd03ef/2022-2023/c602/Manchester-City-Stats-FA-Community-Shield
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/schedule/Manchester-City-Scores-and-Fixtures-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/keeper/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/passing_types/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/gca/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/defense/Manchester-City-Match-Logs-All-Competitions
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/possession/Manchester-City-Match-Logs-All-
I needed to change the above code to list comprehension but some reasons it didn’t respond:
href_list = [link['href'] for link in parsed_page("a", href=True) if 'href' in link]
href_list
[ ] - empty list
I don’t understand. why won’t it work??
I also tried another format. Because the link i want to get looks like
/en/squads/b8fd03ef/2022-2023/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions
so:
or link in links:
if "matchlogs/all_comps/shooting/" in link:
link
else:
pass
still empty.
What am I doing wrong so far. I have used .get
and links['href']
methods to try to get the href
property but no avail.
2
Answers
You’re super close, but you just need to narrow down your search. Right now you’re scraping the entirety of each of the links, and the end up with a few layers of XML(I believe, it’s been 10 years since I’ve touched BeautifulSoup.
You just need to specify further by saying that you’re looking specifically for the
href
attribute inside of all of thea
elements. I found another question that was pretty similar to yours that’s been answered more thoroughly, but essentially here’s the gist of it:You can use find_all in the following way to find every a element that has an href attribute, and print each one:
I’m not sure which version of Python or BeautifulSoup you’re using, but you should pay mind to that because they changed some small things in the newer versions that could keep you help up if you don’t realize it.
BeautifulSoup getting href [duplicate]
I needed to change the above code to list comprehension but some reasons it didn’t respond:
This would not result as expected because you try to check the object against a string. So you better would try following to get your list of links:
or check against the value of each
href
You could also directly search for the string that should be contained in the
href
value (css selectors
):Example
Not exactly known what page you are starting from, so I used the following.