skip to Main Content

I’m web scraping a bunch of heights for listed athletes. I have written the code to get the heights but after inspecting element, I noticed that under text the height is written in feet, but in "data-sort" that height is listed in inches. Both of these are in the td tag in class "heights". However when I use "get_text()" or .text to remove the html elements it only prints out the height in feet and removes the hidden height in inches. Is there a way I can get the height listed in inches because that will make it easier to the do math.

Here is an example of what I’m web scraping, I want remove everything and only get the height in inches which will be [79,85,74… in this case.

<td class="height" data-sort="79">6-7</td>
<td class="height" data-sort="85">7-1</td>
<td class="height" data-sort="74">6-2</td>
#This is my code

from bs4 import BeautifulSoup
import requests 

urls=['https://goduke.com/sports/mens-basketball/roster']

ListData=[]
for x in range(len(urls)):
    page=requests.get(urls[x]).text
    pagesoup=BeautifulSoup(page,'html.parser')
    h=pagesoup.find_all('td', class_="height")
    ListData.append(h)
NewList=[]
for b in range(len(ListData)):
    new=[]
    for x in ListData[b]:
        print(x.text)

2

Answers


  1. If you use css selector you can simply pass the first class name.

    from scrapy.selector import Selector

    Login or Signup to reply.
  2. from bs4 import BeautifulSoup
    import requests 
    
    urls=['https://goduke.com/sports/mens-basketball/roster']
    
    ListData=[]
    
    for url in urls:
        page=requests.get(url).text
        pagesoup=BeautifulSoup(page,'html.parser')
        tds = pagesoup.select('td.height[data-sort]')
        for td in tds:
            ListData.append(td.attrs['data-sort'])
    print(ListData)
    

    output

    ['79', '85', '74', '74', '77', '77', '78', '77', '82', '85', '80', '84', '77', '84', '68']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search