skip to Main Content

I am doing a web scraping project, my main goal is to web scrape from the website basketball-reference.com. The goal I have is to extract statistics from the best player from the Miami Heat and 2 other Miami based teams to make something along the lines of digital flash cards to compare performance. Here is my current code in progress (The current output is the else statement, "Jimmy Butler’s statistics not found.")

Before I show the code, I am still learning the ins and out of web scraping. What is the exact process to properly extract the desired information from the HTML. I appreciate any and all help!


Current code as of 7/5/2023:

import requests
from bs4 import BeautifulSoup

# URL of the webpage with Jimmy Butler's statistics
url = "https://www.basketball-reference.com/players/b/butleji01.html"

# Send a GET request to the webpage
response = requests.get(url)
html_content = response.content

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find the HTML table that contains the statistics
table = soup.find("table", {"id": "per_game"})

# Extract the table headers (column names)
headers = table.find("thead").find_all("th")
column_names = [header.text for header in headers]

# Find the row that corresponds to Jimmy Butler
rows = table.find("tbody").find_all("tr")
jimmy_butler_row = None
for row in rows:
if row.find("th").text == "Jimmy Butler":
    jimmy_butler_row = row
    break

# Check if the row for Jimmy Butler was found
if jimmy_butler_row is not None:
# Extract the statistics for Jimmy Butler
stats = jimmy_butler_row.find_all("td")
pts = stats[22.9].text
ast = stats[5.3].text
fg_percentage = stats[53.9].text
ft_percentage = stats[85.0].text

# Store the extracted statistics in a data structure
jimmy_butler_stats = {
    "Points (PTS)": pts,
    "Assists (AST)": ast,
    "Field Goal Percentage (FG%)": fg_percentage,
    "Free Throw Percentage (FT%)": ft_percentage
}

# Print the extracted statistics
print("Jimmy Butler's Statistics:")
for stat_name, stat_value in jimmy_butler_stats.items():
    print(stat_name + ":", stat_value)
else:
print("Jimmy Butler's statistics not found.")

2

Answers


  1. You have no indent after your if statement and also no indent after your else statement. Thereby nothing is executed based on their evaluation. The indentation is needed for the affiliation of codeblocks to if and for statements since python does not use brackets for that like other programming languages do.
    I guess you wanted to something like that:

    # Check if the row for Jimmy Butler was found
    if jimmy_butler_row is not None:
        # Extract the statistics for Jimmy Butler
        stats = jimmy_butler_row.find_all("td")
        pts = stats[22.9].text
        ast = stats[5.3].text
        fg_percentage = stats[53.9].text
        ft_percentage = stats[85.0].text
    
        # Store the extracted statistics in a data structure
        jimmy_butler_stats = {
            "Points (PTS)": pts,
            "Assists (AST)": ast,
            "Field Goal Percentage (FG%)": fg_percentage,
            "Free Throw Percentage (FT%)": ft_percentage
        }
    
        # Print the extracted statistics
        print("Jimmy Butler's Statistics:")
        for stat_name, stat_value in jimmy_butler_stats.items():
            print(stat_name + ":", stat_value)
    else:
        print("Jimmy Butler's statistics not found.")
    

    EDIT:
    As it seems I missed another indentation error

    for row in rows:
        if row.find("th").text == "Jimmy Butler":
            jimmy_butler_row = row
    

    This still does not find Jimmy Butler’s statistics since there is no table header (th) with text Jimmy Butler’s, I only get some dates if I let print out all the th:

    2011-12
    2012-13
    2013-14
    2014-15
    2015-16
    2016-17
    2017-18
    2018-19
    2018-19
    2018-19
    2019-20
    2020-21
    2021-22
    2022-23
    

    When looking at that page I also can not see a table header with text Jimmy Butler, are you maybe using the wrong page/url?

    Login or Signup to reply.
  2. There is no <th> tags with the text 'Jimmy Butler', hence that if statement will return False, and go to your else statement. Based on your description, I am going to assume you are trying to pull stats from the team site 'https://www.basketball-reference.com/teams/MIA/2023.html

    There are a few other things you need to fix and I’ll comment it in the code:

    import requests
    from bs4 import BeautifulSoup
    
    # URL of the webpage with Jimmy Butler's statistics
    url = "https://www.basketball-reference.com/teams/MIA/2023.html" # Team stats page
    
    # Send a GET request to the webpage
    response = requests.get(url)
    html_content = response.content
    
    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(html_content, "html.parser") # you stored response.content, so might as well use that variable here instead of response.content
    
    # Find the HTML table that contains the statistics
    table = soup.find("table", {"id": "per_game"})
    
    # Extract the table headers (column names)
    headers = table.find("thead").find_all("th")
    column_names = [header.text for header in headers]
    
    # Find the row that corresponds to Jimmy Butler
    rows = table.find("tbody").find_all("tr")
    jimmy_butler_row = None
    for row in rows:
        if row.find("td").text == "Jimmy Butler":   # the text Jimmy Butler is under a td tag, not a th tag
            jimmy_butler_row = row
            break
    
    # Check if the row for Jimmy Butler was found
    if jimmy_butler_row is not None:
        # Extract the statistics for Jimmy Butler
        pts = jimmy_butler_row.find('td', {'data-stat': 'pts_per_g'}).text     # Heres how you want to grab those stats using the tags and attributes
        ast = jimmy_butler_row.find('td', {'data-stat': 'ast_per_g'}).text
        fg_percentage = jimmy_butler_row.find('td', {'data-stat': 'fg_pct'}).text
        ft_percentage = jimmy_butler_row.find('td', {'data-stat': 'ft_pct'}).text
        
        # Store the extracted statistics in a data structure
        jimmy_butler_stats = {
            "Points (PTS)": pts,
            "Assists (AST)": ast,
            "Field Goal Percentage (FG%)": fg_percentage,
            "Free Throw Percentage (FT%)": ft_percentage
        }
        
        # Print the extracted statistics
        print("Jimmy Butler's Statistics:")
        for stat_name, stat_value in jimmy_butler_stats.items():
            print(stat_name + ":", stat_value)
    else:
        print("Jimmy Butler's statistics not found.")
    

    Output:

    Jimmy Butler's Statistics:
    Points (PTS): 22.9
    Assists (AST): 5.3
    Field Goal Percentage (FG%): .539
    Free Throw Percentage (FT%): .850
    

    Lastly, tables are a great way to learn beautifulsoup and html since they are well structured. However, once you get a hang of it, consider using Pandas to pull table tags:

    import pandas as pd
    
    url = "https://www.basketball-reference.com/teams/MIA/2023.html"
    df = pd.read_html(url, attrs={'id':'per_game'})
    

    Then could just filter the df.

    Output:

    print(df)
    [    Rk             Player  Age   G  GS    MP  ...  AST  STL  BLK  TOV   PF   PTS
    0    1        Tyler Herro   23  67  67  34.9  ...  4.2  0.8  0.2  2.4  1.5  20.1
    1    2        Bam Adebayo   25  75  75  34.6  ...  3.2  1.2  0.8  2.5  2.8  20.4
    2    3       Jimmy Butler   33  64  64  33.4  ...  5.3  1.8  0.3  1.6  1.3  22.9
    3    4         Kyle Lowry   36  55  44  31.2  ...  5.1  1.0  0.4  1.9  2.6  11.2
    4    5       Caleb Martin   27  71  49  29.3  ...  1.6  1.0  0.4  1.1  2.0   9.6
    5    6          Max Strus   26  80  33  28.4  ...  2.1  0.5  0.2  0.9  2.1  11.5
    6    7     Victor Oladipo   30  42   2  26.3  ...  3.5  1.4  0.3  2.1  2.4  10.7
    7    8       Gabe Vincent   26  68  34  25.9  ...  2.5  0.9  0.1  1.4  2.3   9.4
    8    9         Kevin Love   34  21  17  20.0  ...  1.9  0.4  0.2  1.1  1.5   7.7
    9   10  Haywood Highsmith   26  54  11  17.9  ...  0.8  0.7  0.3  0.8  1.5   4.4
    10  11    Duncan Robinson   28  42   1  16.5  ...  1.1  0.3  0.0  0.7  1.8   6.4
    11  12     Jamaree Bouyea   23   4   0  16.3  ...  1.0  1.0  0.5  1.0  1.3   3.8
    12  13        Cody Zeller   30  15   2  14.5  ...  0.7  0.2  0.3  0.9  2.2   6.5
    13  14   Orlando Robinson   22  31   1  13.7  ...  0.8  0.4  0.4  0.5  1.7   3.7
    14  15       Nikola Jović   19  15   8  13.6  ...  0.7  0.5  0.1  0.7  1.3   5.5
    15  16          Dru Smith   25   5   1  13.4  ...  1.0  0.8  0.6  0.2  2.0   2.2
    16  17         Jamal Cain   23  18   0  13.3  ...  0.7  0.6  0.1  0.3  1.2   5.4
    17  18     Dewayne Dedmon   33  30   0  11.7  ...  0.5  0.2  0.5  0.6  2.0   5.7
    18  19      Udonis Haslem   42   7   1  10.1  ...  0.0  0.1  0.3  0.1  1.6   3.9
    19  20     Omer Yurtseven   24   9   0   9.2  ...  0.2  0.2  0.2  0.4  1.8   4.4
    
    [20 rows x 28 columns]]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search