skip to Main Content

I am trying to scrape the table from baseball reference: https://www.baseball-reference.com/players/b/bondsba01.shtml, and the table I want is the one with id="batting_value", but when I trying to print out what I have scraped, the program returned an empty list instead. Any information or assistance is appreciated, thanks!

from bs4 import BeautifulSoup
from urllib.request import urlopen

root_page = "https://www.baseball-reference.com/players/b/bondsba01.shtml"
soup = BeautifulSoup(urlopen(root_page), features = 'lxml')

table = soup.find('table', id = 'batting_value')
print(table)

I’ve tried to print the <div> with id="div_batting_value" which contains the table in it, but still doesn’t work. However, I can successfully print out other <div> elements with different id.

2

Answers


  1. Main issue here is that the table is hidden in the comments, so you have to bring it up first, before BeautifulSoup could find it – simplest solution in my opinion is to replace the specific characters in this case:

    .replace('<!--','').replace('-->','')
    

    Alternative is to be more specific and use bs4.Comment

    Example
    import requests
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(
            requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->','')
    )
    soup.select_one('#batting_value')
    

    Or in use with pandas.read_html():

    import requests
    import pandas as pd
    
    df = pd.read_html(requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->',''), attrs={'id':'batting_value'})[0]
    df[(~df.Lg.isna()) & (df.Lg != 'Lg')]
    

    Results in:

    Year Age Tm Lg G PA Rbat Rbaser Rdp Rfield Rpos RAA WAA Rrep RAR WAR waaWL% 162WL% oWAR dWAR oRAR Salary Pos Awards
    0 1986 21 PIT NL 113 484 3 5 0 8 1 17 1.9 16 34 3.5 0.517 0.512 2.6 1 25 $60,000 *8/H RoY-6
    1 1987 22 PIT NL 150 611 11 3 1 24 -3 36 3.7 21 57 5.8 0.525 0.523 3.2 2.1 33 $100,000 *78H/9 nan
    20 2006 41 SFG NL 130 493 30 1 0 1 -4 27 2.5 15 42 4 0.52 0.516 3.9 -0.4 41 $19,331,470 *7H/D nan
    21 2007 42 SFG NL 126 477 37 -1 -1 -10 -4 21 2 15 36 3.4 0.516 0.513 4.4 -1.5 46 $15,533,970 *7H/D AS
    Login or Signup to reply.
  2. There is only one table on the page:

    print(len(soup.find_all('table')))
    

    output: 1

    You can use simple find to get the table:

    table = soup.find_all('table'))
    

    And work with it. For example, there are rows:

    table.find('tbody').find_all('th')
    

    Does this solve your task?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search