skip to Main Content

Currently, I am trying to perform web scraping using Python on the ESPN website to this upcoming NFL football game schedule for each week and store into a dataframe. I’m unable to find a way to add the desired output. I am also super new to coding, python and everything in general. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2

I wanted to output a data frame with columns: away team, home team, game time, game location, and odds.

So far using the following code, I was able to get the team names and put it into a dataframe. See below.

url = 'https://www.espn.com/nfl/scoreboard/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko)  Chrome/58.0.3029.110 Safari/537.3"
    }

# Send a GET request to the webpage with headers
response = requests.get(url, headers=headers)
src = (response.content)

soup = BeautifulSoup(response.content, 'html.parser')
# Find all the game containers
game_containers = soup.find_all('a',class_="AnchorLink" )
team_names = soup.find_all('div', class_='ScoreCell__TeamName ScoreCell__TeamName--
shortDisplayName     truncate db')
# List to hold the team names
team_list = [team.text for team in team_names]
# Pair the team names into away and home teams
away_teams = team_list[::2]  # Every other team starting from the first
home_teams = team_list[1::2]  # Every other team starting from the second

# Create a DataFrame from the data
df = pd.DataFrame({
    'Away Team': away_teams,
    'Home Team': home_teams
})

# Print the DataFrame
 print(df)

I’ll explain below what I did and What I see from the HTML inspect.

This is where I am stuck and my shallow knowledge limits me. Not sure how to code to extract those information from this HTML code. Any help or advice is appreciate it. Thanks!

  • WHAT I SEE
    Now getting the time, location and odds is tricky and I need some help as I have no idea when looking at the HTML code on ESPN. From what I can tell, the body of the webpage that contains all the schedule is:
    <div class="mt3">
    Then each game box section is then displayed by
    t<div><div class="ScheduleTables mb5 ScheduleTables--nfl ScheduleTables--football

When I dive deeper, the lines:
<tbody class="Table__TBODY"><tr class="Table__TR Table__TR--sm Table__even" data-idx="0">
contains all the information I need.

Embedded under Table__TR class is the following:
<td class="colspan__col Table__TD"> <td class="date__col Table__TD"><a class="AnchorLink" tabindex="0" href="/nfl/game/_/gameId/401671789/ravens- chiefs">8:20 PM</a></td> <td class="location__col Table__TD"><div>GEHA Field at Arrowhead Stadium, Kansas City, MO</div></td> <td class="odds__col Table__TD"><div class="Odds__Message"><a class="AnchorLink" tabindex="0" data-track-event_name="espn bet interaction" data-track- event_detail="espnbet:espn:nfl:schedule:pointSpread:KC -3" data-track-basemetrics="sport,league"

2

Answers


  1. Better way is to get data from apis, as it’s more robust (Ie. It’s not reliant on the html structure. If ESPN changes their web design, your code breaks – but with the api, data will usually always come in the same json form), and you get far more data if you want it:

    import requests
    import pandas as pd
    
    # Function to check if an element is a list or dictionary
    def is_list_or_dict(x):
        return isinstance(x, (list, dict))
    
    def merge_data(data):
        game_df = pd.json_normalize(data)
        game_df = game_df.drop(['uid'], axis=1)
    
        team_df = pd.json_normalize(data,
                                record_path=['competitions', 'competitors'],
                                meta=['id'],
                                meta_prefix='game.')
        team_df = team_df.drop(['id', 'uid'], axis=1)
        
        odds = pd.json_normalize(data,
                                record_path=['competitions', 'odds'],
                                meta=['id'],
                                meta_prefix='game.')
        
        
        columns_to_keep = game_df.applymap(is_list_or_dict).all(axis=0) == False
    
        # Filter the DataFrame to keep only the desired columns
        game_df = game_df.loc[:, columns_to_keep]
        
        df = pd.merge(game_df, team_df, how='outer', left_on=['id'], right_on=['game.id']).drop(['id'], axis=1)
        df = pd.merge(df, odds, how='outer', left_on=['game.id'], right_on=['game.id'])
        
        return df
    
        
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/58.0.3029.110 Safari/537.3"
        }
    
    
    url = 'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024'
    jsonData = requests.get(url, headers=headers).json()
    calendar = jsonData['content']['calendar']
    
    dfs = []
    for each in calendar:
        seasontype = each['value']
        seasontypeLabel = each['label']
        weeks = each['entries']
        for eachWeek in weeks:
            weekNo = eachWeek['value']
            weekLabel = eachWeek['label']
        
            url = f'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024&seasontype={seasontype}&week={weekNo}'
            jsonData = requests.get(url, headers=headers).json()
            schedules = jsonData['content']['schedule']
            
            print(f'Aquiring {seasontypeLabel}: {weekLabel}')
    
            for k,v in schedules.items():
                games = v['games']
                
                df = merge_data(games)
                df['seasontype'] = seasontype
                df['seasontypeLabel'] = seasontypeLabel
                df['week'] = weekNo
                df['weekLabel'] = weekLabel
                
                dfs.append(df)
    
    
    
    results = pd.concat(dfs)
    
    Login or Signup to reply.
  2. 3 Ways to do it:

    1 – Let pandas parse it:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = 'https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2'
    # Headers to make the request look like it's coming from a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
        }
    
    response = requests.get(url, headers=headers).text
    dfs = pd.read_html(response)
    df = pd.concat(dfs)
    

    2 – Your methods with bs4 – which is far more complicated and not going to even code it out. But what you should do is iterate by each <tr> tag, and store each column value in the <td> tags.

    3 – Use an api – see me other solution

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search