skip to Main Content

I am new to coding and need some assistance. I am trying to make a web scraper for a project that involves scraping NFL roster data from 2000 to 2023 but am getting an error requesting the html. I am using Jupyter labs (Python-Pyodide) to write my code and this is the only code I have:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO

years = list(range(2000, 2024))
url = 'https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023'
data = requests.get(url)

This is the error I’m getting:

(JsException: NetworkError: Failed to execute ‘send’ on ‘XMLHttpRequest’: Failed to load ‘https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023’.)

Can you explain why I am getting this error and how do i fix it?

2

Answers


  1. You didn’t specify the request headers. But this page doesnt have table tags, so u cant use pd.read_html

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
    headers = {
      'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
      'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
    }
    result = []
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    table = soup.find('div', class_='divtable divtable-striped divtable-mobile')
    table_head = [head.get_text() for head in table.find('div', class_='thead')]
    for s in table.find_all('span', class_='visible-xs-inline'):
        s.extract()
    for row in table.find_all('div', class_='tr'):
        result.append(dict(zip(table_head, [cell.get_text() for cell in row.find_all('div', class_='td')])))
    df = pd.DataFrame(result)
    print(df)
    

    OUTPUT:

         #            Player Pos   G  GS Age            College
    0   82   Andre Baccellia  WR   5   0  26         Washington
    1    3       Budda Baker  DB  12  12  27         Washington
    2   96        Eric Banks  DE   2   0  25  Texas-San Antonio
    3   51       Krys Barnes  LB  16   6  25               UCLA
    4   66    Jackson Barton  OT   1   0  28               Utah
    ..  ..               ...  ..  ..  ..  ..                ...
    73  21  Garrett Williams  DB   9   6  22           Syracuse
    74  27     Divaad Wilson  DB   2   1  23    Central Florida
    75  20      Marco Wilson  DB  15  11  24            Florida
    76  14    Michael Wilson  WR  13  12  23           Stanford
    77  10        Josh Woods  LB  11   7  27           Maryland
    
    Login or Signup to reply.
  2. You need to send headers with your get request. Specifically User-Agent. When you send this value it mocks as if the request comes from a browser e.g. a real person and not a bot/scraper. You can find this value easily by Googling "what is my user agent". Copy that entire thing; you will need it in a minute.

    Declare a dict using the value you copied:

    my_headers = {
        "User-Agent": "<YOUR_VALUE>"
    }
    

    Pass headers as an argument in the get method:

    my_url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
    data = requests.get(url=my_url, headers=my_headers)
    print(data.content) # just to confirm you got the response back
    

    Here is the scenic route to get your User-Agent and see what values are/could be there in "headers", if you’re interested:

    1. Hit F12 on your keyboard when viewing this page. The developer tools will open up.
    2. Navigate to the "Network" tab
    3. Choose "All"
    4. If you don’t see anything, no worries; just refresh the page
    5. Click on an item, you will see another section pop up
    6. Click on "Headers" and scroll down until you find "User-Agent"
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search