skip to Main Content

I’m trying to build a simple web scrapping tool.
Right now I’m having an issue extracting data from each row because <tr> header is missing.
(Only <tr> header is missing, and <tr> header is still there)

Below is my code

from bs4 import BeautifulSoup
import requests

url = "https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/"
data = requests.get(url).text
print(data)

It’s missing a header, and only exists for each row

<tbody>
((THERES SUPPOSED TO BE A <tr> TAG HERE))))!!!
<td class="fav"><img alt="favorite icon" src="/img/fav.svg?v2" data-id="2"></td>
</td><td class="rank-td td-right" data-sort="1">1
</td><td class="name-td">
<div class="logo-container"><img loading="lazy" class="company-logo" alt="Apple logo" src="/img/company-logos/64/AAPL.png" data-img-path="/img/company-logos/64/AAPL.png" data-img-dark-path="/img/company-logos/64/AAPL.D.png"></div>
<div class="name-div"><a href="/apple/marketcap/"><div class="company-name">Apple</div>
<div class="company-code"><span class="rank d-none"></span>AAPL</div>
</a></div></td><td class="td-right" data-sort="2891576508416">$2.891 T</td><td class="td-right" data-sort="18592">$185.92</td><td data-sort="18" class="rh-sm"><span class="percentage-green"><svg class="a" viewBox="0 0 12 12"><path d="M10 8H2l4-4 4 4z"></path></svg>0.18%</span></td><td class="p-0 sparkline-td red"><svg><path d="M0,21 5,18 10,22 15,14 20,16 25,12 30,8 35,14 40,11 45,3 50,3 55,4 60,8 65,6 70,10 75,11 80,13 85,13 90,14 95,14 100,13 105,16 110,16 115,31 120,34 125,39 130,41 135,31 140,32 145,30 150,31 155,30" /></svg></td><td>πŸ‡ΊπŸ‡Έ <span class="responsive-hidden">USA</span></td>
</tr>

Thank you!

+
I tried following

soup = BeautifulSoup(data, "lxml")
table = soup.find("table")
# print(table)
rows = table.find_all("tr")

but it doesn’t work, because again, <tr> header is missing

2

Answers


  1. The issue is the HTML of the page is malformed. So to parse it like a browser does use html5lib parser:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/'
    
    soup = BeautifulSoup(requests.get(url).content, 'html5lib')
    
    for tr in soup.table.select('tr'):
        tds = [t for td in tr.select('td') if (t:=td.get_text(strip=True, separator=' '))]
        if len(tds) == 6:
            print(*tds, sep='t')
    

    Prints:

    1       Apple AAPL      $2.891 T        $185.92 0.18%   πŸ‡ΊπŸ‡Έ USA
    2       Microsoft MSFT  $2.887 T        $388.47 1.00%   πŸ‡ΊπŸ‡Έ USA
    3       Visa V  $542.91 B       $264.17 0.05%   πŸ‡ΊπŸ‡Έ USA
    4       JPMorgan Chase JPM      $488.72 B       $169.05 0.73%   πŸ‡ΊπŸ‡Έ USA
    5       UnitedHealth UNH        $482.35 B       $521.51 3.37%   πŸ‡ΊπŸ‡Έ USA
    6       Walmart WMT     $434.31 B       $161.32 0.13%   πŸ‡ΊπŸ‡Έ USA
    7       Johnson & Johnson JNJ   $390.91 B       $162.39 0.77%   πŸ‡ΊπŸ‡Έ USA
    8       Procter & Gamble PG     $354.94 B       $150.60 0.06%   πŸ‡ΊπŸ‡Έ USA
    
    ...
    
    Login or Signup to reply.
  2. There is header above tbody (table > thead). You don’t need header if you want to extract data in the table.
    Just refer to the "loc", which is CSS selector and can be used in BS4 select, in the following xml to extract the data in the table:

      <actions>
        <action_goto url="https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/" />
        <action_loopineles>
          <element loc="table.dataTable > tbody > tr:not([class*=sort])" />
            <action_extract tabname="dat_00000000000012ab">
              <column_element colname="c01" nickname="rank">
                <element loc="td.rank-td" />
                <elecontent_text top="true" />
              </column_element>
              <column_element colname="c02" nickname="name">
                <element loc="td.name-td div.company-name" />
              </column_element>
              <column_element colname="c03" nickname="marketCap">
                <element loc="tr > td:nth-child(4)" />
              </column_element>
              <column_element colname="c04" nickname="price">
                <element loc="tr > td:nth-child(5)" />
              </column_element>
              <column_element colname="c05" nickname="today">
                <element loc="tr > td:nth-child(6)" />
              </column_element>
              <column_element colname="c06" nickname="country">
                <element loc="tr > td:nth-child(8)" />
              </column_element>
            </action_extract>
        </action_loopineles>
      </actions>
    

    Sample of extracted data:
    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search