Currently, I am trying to perform web scraping using Python on the ESPN website to this upcoming NFL football game schedule for each week and store into a dataframe. I’m unable to find a way to add the desired output. I am also super new to coding, python and everything in general. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2
I wanted to output a data frame with columns: away team, home team, game time, game location, and odds.
So far using the following code, I was able to get the team names and put it into a dataframe. See below.
url = 'https://www.espn.com/nfl/scoreboard/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
# Send a GET request to the webpage with headers
response = requests.get(url, headers=headers)
src = (response.content)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the game containers
game_containers = soup.find_all('a',class_="AnchorLink" )
team_names = soup.find_all('div', class_='ScoreCell__TeamName ScoreCell__TeamName--
shortDisplayName truncate db')
# List to hold the team names
team_list = [team.text for team in team_names]
# Pair the team names into away and home teams
away_teams = team_list[::2] # Every other team starting from the first
home_teams = team_list[1::2] # Every other team starting from the second
# Create a DataFrame from the data
df = pd.DataFrame({
'Away Team': away_teams,
'Home Team': home_teams
})
# Print the DataFrame
print(df)
I’ll explain below what I did and What I see from the HTML inspect.
This is where I am stuck and my shallow knowledge limits me. Not sure how to code to extract those information from this HTML code. Any help or advice is appreciate it. Thanks!
- WHAT I SEE
Now getting the time, location and odds is tricky and I need some help as I have no idea when looking at the HTML code on ESPN. From what I can tell, the body of the webpage that contains all the schedule is:
<div class="mt3">
Then each game box section is then displayed by
t<div><div class="ScheduleTables mb5 ScheduleTables--nfl ScheduleTables--football
When I dive deeper, the lines:
<tbody class="Table__TBODY"><tr class="Table__TR Table__TR--sm Table__even" data-idx="0">
contains all the information I need.
Embedded under Table__TR class is the following:
<td class="colspan__col Table__TD"> <td class="date__col Table__TD"><a class="AnchorLink" tabindex="0" href="/nfl/game/_/gameId/401671789/ravens- chiefs">8:20 PM</a></td> <td class="location__col Table__TD"><div>GEHA Field at Arrowhead Stadium, Kansas City, MO</div></td> <td class="odds__col Table__TD"><div class="Odds__Message"><a class="AnchorLink" tabindex="0" data-track-event_name="espn bet interaction" data-track- event_detail="espnbet:espn:nfl:schedule:pointSpread:KC -3" data-track-basemetrics="sport,league"
2
Answers
Better way is to get data from apis, as it’s more robust (Ie. It’s not reliant on the html structure. If ESPN changes their web design, your code breaks – but with the api, data will usually always come in the same json form), and you get far more data if you want it:
3 Ways to do it:
1 – Let pandas parse it:
2 – Your methods with bs4 – which is far more complicated and not going to even code it out. But what you should do is iterate by each
<tr>
tag, and store each column value in the<td>
tags.3 – Use an api – see me other solution