Html - Python/BeautifulSoup: How to remove tags from the elements?

user23955
June 15, 2024
110 views
3 votes
3 Answers

I have an html source like this:

`<tbody><tr ><th ...><a href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td class="center " data-stat="position" csk="2.0" >DF</td><td class="left " data-stat="team" ><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center " data-stat="age" >23</td>`

My code:

> import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

res = requests.get(f"https://fbref.com/en/comps/9/stats/Premier-League-Stats")
comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
player_data = all_data[2]

data = []

for row in player_data.find_all("tr"):
    name = row.find("a")
    team = row.find("td", {"data-stat": "team"})
    age = row.find("td", {"data-stat": "age"})

    data.append([name, team, age])

df = pd.DataFrame(data, columns=['Name', 'Team', 'Age'])

When I print df, it gives me the desired outcome:

[Max Aarons] [[Bournemouth]] [23]

But when I export the output to excel using to_excel, I get this:

< a href="/en/players/774cf58b/Max-Aarons">Max Aarons< /a >
< td class="left" data-stat="team">Bournemouth< /td >
< td class="center" data-stat="age">23< /td >

but I don’t want the tags/permalinks

I tried get_text() but got the object has no attribute 'get_text' error

Answers

- mcode11
- June 15, 2024 at 3:33 pm
- 0 votes
0
Instead of get_text() method, you can wrap the element in str(elem).
After that you’ll have something like this <a href="somewhere">Some Name</a>.
To remove that, you can do:
```
elemtext = str(elem)
prefix = elemtext.split('>')[0] + '>'
inner = elemtext.replace(prefix, '').split('<')[0]
print(inner) # for test
```
inner will be the name.
If inner is HTML escaped, import module html and call the method html.unescape(inner).
The function result will be the name.
Login or Signup to reply.

You could use the .text method on bs4 Tag to only get the text part. Here is your code, with this method.

[...]

for row in player_data.find_all("tr"):
    name = row.find("a")
    team = row.find("td", {"data-stat": "team"})
    age = row.find("td", {"data-stat": "age"})
    if name is None:
        continue # There are blank lines and this line will skip them
    name = name.text
    team = team.text
    age = int(age.text)


    data.append([name, team, age])

df = pd.DataFrame(data, columns=['Name', 'Team', 'Age'])

There is not an <a> element in every <tr> that you observe. You should add some checks. You can then use the .text attribute of the valid elements that you discover.

Also, you don’t need to build a list because the first parameter to DataFrame can be an Iterator.

Therefore:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

URL = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
TEAM = {"data-stat": "team"}
AGE = {"data-stat": "age"}


def gendata(tbody):
    for row in tbody.find_all("tr"):
        if name := row.find("a"):
            if team := row.find("td", TEAM):
                if age := row.find("td", AGE):
                    yield name.text, team.text, age.text


def nocomment(html):
    return re.sub("<!--|-->", "", html)


with requests.get(URL) as response:
    response.raise_for_status()
    soup = BeautifulSoup(nocomment(response.text), "lxml")
    if len(all_data := soup.select("tbody")) > 2:
        df = pd.DataFrame(gendata(all_data[2]), columns=["Name", "Team", "Age"])
        df["Age"] = df["Age"].astype(int)
        df.to_excel("football.xlsx", index=False)
    else:
        print("Unexpected HTML structure")

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Python/BeautifulSoup: How to remove tags from the elements?

Answers