Html - Only <tr> tag is missing using requests library

SnowHana
January 13, 2024
166 views
1 vote
2 Answers

I’m trying to build a simple web scrapping tool.
Right now I’m having an issue extracting data from each row because <tr> header is missing.
(Only <tr> header is missing, and <tr> header is still there)

Below is my code

from bs4 import BeautifulSoup
import requests

url = "https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/"
data = requests.get(url).text
print(data)

It’s missing a header, and only exists for each row

<tbody>
((THERES SUPPOSED TO BE A <tr> TAG HERE))))!!!
<td class="fav"><img alt="favorite icon" src="/img/fav.svg?v2" data-id="2"></td>
</td><td class="rank-td td-right" data-sort="1">1
</td><td class="name-td">
<div class="logo-container"><img loading="lazy" class="company-logo" alt="Apple logo" src="/img/company-logos/64/AAPL.png" data-img-path="/img/company-logos/64/AAPL.png" data-img-dark-path="/img/company-logos/64/AAPL.D.png"></div>
<div class="name-div"><a href="/apple/marketcap/"><div class="company-name">Apple</div>
<div class="company-code"><span class="rank d-none"></span>AAPL</div>
</a></div></td><td class="td-right" data-sort="2891576508416">$2.891 T</td><td class="td-right" data-sort="18592">$185.92</td><td data-sort="18" class="rh-sm"><span class="percentage-green"><svg class="a" viewBox="0 0 12 12"><path d="M10 8H2l4-4 4 4z"></path></svg>0.18%</span></td><td class="p-0 sparkline-td red"><svg><path d="M0,21 5,18 10,22 15,14 20,16 25,12 30,8 35,14 40,11 45,3 50,3 55,4 60,8 65,6 70,10 75,11 80,13 85,13 90,14 95,14 100,13 105,16 110,16 115,31 120,34 125,39 130,41 135,31 140,32 145,30 150,31 155,30" /></svg></td><td>🇺🇸 <span class="responsive-hidden">USA</span></td>
</tr>

Thank you!

+
I tried following

soup = BeautifulSoup(data, "lxml")
table = soup.find("table")
# print(table)
rows = table.find_all("tr")

but it doesn’t work, because again, <tr> header is missing

Answers

The issue is the HTML of the page is malformed. So to parse it like a browser does use html5lib parser:

import requests
from bs4 import BeautifulSoup

url = 'https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/'

soup = BeautifulSoup(requests.get(url).content, 'html5lib')

for tr in soup.table.select('tr'):
    tds = [t for td in tr.select('td') if (t:=td.get_text(strip=True, separator=' '))]
    if len(tds) == 6:
        print(*tds, sep='t')

Prints:

1       Apple AAPL      $2.891 T        $185.92 0.18%   🇺🇸 USA
2       Microsoft MSFT  $2.887 T        $388.47 1.00%   🇺🇸 USA
3       Visa V  $542.91 B       $264.17 0.05%   🇺🇸 USA
4       JPMorgan Chase JPM      $488.72 B       $169.05 0.73%   🇺🇸 USA
5       UnitedHealth UNH        $482.35 B       $521.51 3.37%   🇺🇸 USA
6       Walmart WMT     $434.31 B       $161.32 0.13%   🇺🇸 USA
7       Johnson & Johnson JNJ   $390.91 B       $162.39 0.77%   🇺🇸 USA
8       Procter & Gamble PG     $354.94 B       $150.60 0.06%   🇺🇸 USA

...

There is header above tbody (table > thead). You don’t need header if you want to extract data in the table.
Just refer to the "loc", which is CSS selector and can be used in BS4 select, in the following xml to extract the data in the table:

  <actions>
    <action_goto url="https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/" />
    <action_loopineles>
      <element loc="table.dataTable > tbody > tr:not([class*=sort])" />
        <action_extract tabname="dat_00000000000012ab">
          <column_element colname="c01" nickname="rank">
            <element loc="td.rank-td" />
            <elecontent_text top="true" />
          </column_element>
          <column_element colname="c02" nickname="name">
            <element loc="td.name-td div.company-name" />
          </column_element>
          <column_element colname="c03" nickname="marketCap">
            <element loc="tr > td:nth-child(4)" />
          </column_element>
          <column_element colname="c04" nickname="price">
            <element loc="tr > td:nth-child(5)" />
          </column_element>
          <column_element colname="c05" nickname="today">
            <element loc="tr > td:nth-child(6)" />
          </column_element>
          <column_element colname="c06" nickname="country">
            <element loc="tr > td:nth-child(8)" />
          </column_element>
        </action_extract>
    </action_loopineles>
  </actions>

Sample of extracted data:

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Only <tr> tag is missing using requests library

Answers