I have an html source like this:
`<tbody><tr ><th ...><a href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td class="center " data-stat="position" csk="2.0" >DF</td><td class="left " data-stat="team" ><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center " data-stat="age" >23</td>`
My code:
> import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
res = requests.get(f"https://fbref.com/en/comps/9/stats/Premier-League-Stats")
comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
player_data = all_data[2]
data = []
for row in player_data.find_all("tr"):
name = row.find("a")
team = row.find("td", {"data-stat": "team"})
age = row.find("td", {"data-stat": "age"})
data.append([name, team, age])
df = pd.DataFrame(data, columns=['Name', 'Team', 'Age'])
When I print df
, it gives me the desired outcome:
[Max Aarons] [[Bournemouth]] [23]
But when I export the output to excel using to_excel
, I get this:
< a href="/en/players/774cf58b/Max-Aarons">Max Aarons< /a >
< td class="left" data-stat="team">Bournemouth< /td >
< td class="center" data-stat="age">23< /td >
but I don’t want the tags/permalinks
I tried get_text()
but got the object has no attribute 'get_text'
error
3
Answers
Instead of
get_text()
method, you can wrap the element instr(elem)
.After that you’ll have something like this
<a href="somewhere">Some Name</a>
.To remove that, you can do:
inner
will be the name.If
inner
is HTML escaped, import modulehtml
and call the methodhtml.unescape(inner)
.The function result will be the name.
You could use the
.text
method on bs4 Tag to only get the text part. Here is your code, with this method.There is not an <a> element in every <tr> that you observe. You should add some checks. You can then use the .text attribute of the valid elements that you discover.
Also, you don’t need to build a list because the first parameter to DataFrame can be an Iterator.
Therefore: