I’m trying to build a simple web scrapping tool.
Right now I’m having an issue extracting data from each row because <tr>
header is missing.
(Only <tr>
header is missing, and <tr>
header is still there)
Below is my code
from bs4 import BeautifulSoup
import requests
url = "https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/"
data = requests.get(url).text
print(data)
It’s missing a header, and only exists for each row
<tbody>
((THERES SUPPOSED TO BE A <tr> TAG HERE))))!!!
<td class="fav"><img alt="favorite icon" src="/img/fav.svg?v2" data-id="2"></td>
</td><td class="rank-td td-right" data-sort="1">1
</td><td class="name-td">
<div class="logo-container"><img loading="lazy" class="company-logo" alt="Apple logo" src="/img/company-logos/64/AAPL.png" data-img-path="/img/company-logos/64/AAPL.png" data-img-dark-path="/img/company-logos/64/AAPL.D.png"></div>
<div class="name-div"><a href="/apple/marketcap/"><div class="company-name">Apple</div>
<div class="company-code"><span class="rank d-none"></span>AAPL</div>
</a></div></td><td class="td-right" data-sort="2891576508416">$2.891 T</td><td class="td-right" data-sort="18592">$185.92</td><td data-sort="18" class="rh-sm"><span class="percentage-green"><svg class="a" viewBox="0 0 12 12"><path d="M10 8H2l4-4 4 4z"></path></svg>0.18%</span></td><td class="p-0 sparkline-td red"><svg><path d="M0,21 5,18 10,22 15,14 20,16 25,12 30,8 35,14 40,11 45,3 50,3 55,4 60,8 65,6 70,10 75,11 80,13 85,13 90,14 95,14 100,13 105,16 110,16 115,31 120,34 125,39 130,41 135,31 140,32 145,30 150,31 155,30" /></svg></td><td>πΊπΈ <span class="responsive-hidden">USA</span></td>
</tr>
Thank you!
+
I tried following
soup = BeautifulSoup(data, "lxml")
table = soup.find("table")
# print(table)
rows = table.find_all("tr")
but it doesn’t work, because again, <tr>
header is missing
2
Answers
The issue is the HTML of the page is malformed. So to parse it like a browser does use
html5lib
parser:Prints:
There is header above tbody (table > thead). You don’t need header if you want to extract data in the table.
Just refer to the "loc", which is CSS selector and can be used in BS4 select, in the following xml to extract the data in the table:
Sample of extracted data: