skip to Main Content

I have an html source like this:

`<tbody><tr ><th ...><a href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td class="center " data-stat="position" csk="2.0" >DF</td><td class="left " data-stat="team" ><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center " data-stat="age" >23</td>`

My code:

> import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

res = requests.get(f"https://fbref.com/en/comps/9/stats/Premier-League-Stats")
comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
player_data = all_data[2]

data = []

for row in player_data.find_all("tr"):
    name = row.find("a")
    team = row.find("td", {"data-stat": "team"})
    age = row.find("td", {"data-stat": "age"})

    data.append([name, team, age])

df = pd.DataFrame(data, columns=['Name', 'Team', 'Age'])

When I print df, it gives me the desired outcome:

[Max Aarons] [[Bournemouth]] [23]

But when I export the output to excel using to_excel, I get this:

< a href="/en/players/774cf58b/Max-Aarons">Max Aarons< /a >
< td class="left" data-stat="team">Bournemouth< /td >
< td class="center" data-stat="age">23< /td >

but I don’t want the tags/permalinks

I tried get_text() but got the object has no attribute 'get_text' error

3

Answers


  1. Instead of get_text() method, you can wrap the element in str(elem).
    After that you’ll have something like this <a href="somewhere">Some Name</a>.
    To remove that, you can do:

    elemtext = str(elem)
    prefix = elemtext.split('>')[0] + '>'
    inner = elemtext.replace(prefix, '').split('<')[0]
    print(inner) # for test
    

    inner will be the name.
    If inner is HTML escaped, import module html and call the method html.unescape(inner).
    The function result will be the name.

    Login or Signup to reply.
  2. You could use the .text method on bs4 Tag to only get the text part. Here is your code, with this method.

    [...]
    
    for row in player_data.find_all("tr"):
        name = row.find("a")
        team = row.find("td", {"data-stat": "team"})
        age = row.find("td", {"data-stat": "age"})
        if name is None:
            continue # There are blank lines and this line will skip them
        name = name.text
        team = team.text
        age = int(age.text)
    
    
        data.append([name, team, age])
    
    df = pd.DataFrame(data, columns=['Name', 'Team', 'Age'])
    
    Login or Signup to reply.
  3. There is not an <a> element in every <tr> that you observe. You should add some checks. You can then use the .text attribute of the valid elements that you discover.

    Also, you don’t need to build a list because the first parameter to DataFrame can be an Iterator.

    Therefore:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import re
    
    URL = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
    TEAM = {"data-stat": "team"}
    AGE = {"data-stat": "age"}
    
    
    def gendata(tbody):
        for row in tbody.find_all("tr"):
            if name := row.find("a"):
                if team := row.find("td", TEAM):
                    if age := row.find("td", AGE):
                        yield name.text, team.text, age.text
    
    
    def nocomment(html):
        return re.sub("<!--|-->", "", html)
    
    
    with requests.get(URL) as response:
        response.raise_for_status()
        soup = BeautifulSoup(nocomment(response.text), "lxml")
        if len(all_data := soup.select("tbody")) > 2:
            df = pd.DataFrame(gendata(all_data[2]), columns=["Name", "Team", "Age"])
            df["Age"] = df["Age"].astype(int)
            df.to_excel("football.xlsx", index=False)
        else:
            print("Unexpected HTML structure")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search