how get all elements as rows for each href in an html fileand append it to a pandas dataset schema?

AlSub
February 16, 2024
189 views
0 votes
2 Answers

I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales

There should be 1 row that matches each field on the provided headers for each different href element on the html file.

This is one of the portions of the html that I am trying to scrap:


  <tbody>
    
  <tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
  
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04

</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
  </td>
  <td><span class="series">B</span>
  </td><td>03:20</td><td>
    <span class="">22.5</span></td><td>0</td>
    <td>22.5</td><td>0</td><td>0

    </td><td>3</td><td>67.20</td>
    <td>1</td><td>0</td><td>0</td></tr>
    <tr role="row" class="odd">
      <td class="sorting_1">
        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
      <td><span class="series">B</span></td><td>03:20</td><td>
        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>
          <td><span class="series">*</span>
          </td><td>03:20</td><td>
            <span class="color-1">29</span>
          </td><td>28.98</td><td>28.09</td>
            <td>29</td><td>28</td><td>296,871</td>
            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>
            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
              <td>03:20</td>
              <td><span class="color-2">13.48</span>
              </td><td>13.46</td>
              <td>13.53</td><td>13.62</td><td>13.32</td>
              <td>2,706,398</td>
              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
              <td>-0.52</td>
            </tr><tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
              </td><td>03:20</td><td><span class="color-2">10.65</span>
            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
            <tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1862">AMX</a></td>
              <td><span class="series">B</span></td><td>03:20</td>
              <td><span class="color-2">14.56</span></td><td>14.58</td>
              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
              <td>-0.75</td></tr><tr role="row" class="even">
                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
              </td><td><span class="series">10</span></td><td>03:20</td><td>
                <span class="color-2">21.09</span>
              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
              <td>51,005</td><td>1,076,281.67</td>
              <td>22</td><td>-0.34</td><td>-1.59</td></tr>
      </tbody>

And my current code results into an empty dataframe:

# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup


# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!

print('Read html!')

# get headers

tbody = soup.find("thead")
tr = tbody.find_all("tr")

headers= [t.get_text().strip().replace('n', ',').split(',') for t in tr][0]

#print(headers)

df = pd.DataFrame(columns=headers)

# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string

        #print("The value in this cell is %s" % value)

        # append row in dataframe

I would like to know if it’s possible to get a pandas dataframe whose fields are
the ones portrayed in the headers list and the rows are each element from href.

For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:

EMISORA SERIE   HORA    ÚLTIMO   PPP    ANTERIOR    MÁXIMO  MÍNIMO VOLUMEN  IMPORTE OPS.    VAR PUNTOS  VAR %
AC        *    3:20    191.04   191.32  194.51     193.92   191.01  380,544  73,122,008.42   2,793  -3.19    -1.64

Is this possible to create such dataset?

Answers

You can use BeautifulSoup to parse the HTML and extract the necessary information from the href attributes. Then, construct a pandas DataFrame using this information.

Try this:

# Find all rows
rows = soup.find_all('tr')

# Iterate over each row
for row in rows:
    # Find the anchor tag within the row
    anchor = row.find('a')
    if anchor:
        # Extract href and text content
        href = anchor['href']
        text = anchor.text.strip()
        
        # Find all cells in the row
        cells = row.find_all('td')
        # Extract other cell values
        cell_values = [cell.text.strip() for cell in cells]
        
        # Combine all values into a single row
        row_data = [text, *cell_values]
        
        # Append the row data to the main data list
        data.append(row_data)

# Create a DataFrame
df = pd.DataFrame(data, columns=headers)

The website you provided uses Javascript to load content dynamically, meaning Beautifulsoup will receive HTML in its response, but will be missing significant pieces. For this reason, you will have to integrate your Beautifulsoup code with a more sophisticated tool like Selenium.

Most of your code is correct, but is working with only a subset of the data that is shown on the page due to dynamic rendering. I have adjusted your code to include Selenium code; this will navigate to the website, wait for the Javascript to load, scrape the newly-rendered data, and make the table into a Pandas dataframe:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

url = 'https://www.bmv.com.mx/es/mercados/capitales'

driver.get(url)

time.sleep(4) # wait for page to load

html_content = driver.page_source
driver.quit()

soup = BeautifulSoup(html_content, 'html.parser')

table = soup.find('table', id='tableMercados')

#get column names
columns = [th.get_text().strip() for th in table.find_all('th')]
data = []
rows = table.find('tbody').find_all('tr')

for row in rows:
    row_data = [col.get_text().strip() for col in row.find_all('td')]
    data.append(row_data)

df = pd.DataFrame(data, columns=columns)

print(df)

Printing the Dataframe shows the following data:

   EMISORA SERIE   HORA ÚLTIMO     PPP ANTERIOR  MÁXIMO  MÍNIMO     VOLUMEN           IMPORTE    OPS. VAR PUNTOS  VAR %
0       AC     *  03:20  189.3  189.32   191.91  192.98  189.01     502,297     95,483,093.50   3,333      -2.59  -1.35
1  ACCELSA     B  03:20   22.5       0     22.5       0       0           1             22.40       1          0      0
2  ACTINVR     B  03:20  15.03       0       15   15.06   15.03      55,479        833,864.22      22       0.03   0.20
3     AGUA     *  03:20  31.55   31.59     30.2   31.98    29.8     327,380     10,149,130.38   2,283       1.39   4.60
4  ALEATIC     *  03:20   36.5       0     36.5       0       0           7            254.50       2          0      0
5     ALFA     A  03:20  13.19   13.19    13.04    13.4   13.01   4,045,917     53,405,549.08  11,798       0.15   1.15
6    ALPEK     A  03:20  10.84   10.82    10.73   10.93   10.67     369,472      3,999,136.42   4,451       0.09   0.84
7    ALSEA     *  03:20  66.75   66.55    64.98   66.98   65.12   5,939,525    390,969,784.88   3,995       1.57   2.42
8  ALTERNA     B  03:20    1.5       0      1.5       0       0          50             77.50       2          0      0
9      AMX     B  03:20  15.27   15.29    15.01   15.38   15.01  96,328,174  1,472,868,239.86  48,894       0.28   1.87

Which matches the data on the website:

EDIT:
If you wish to scrape all of the data that the website provides, you will need to click the "next" button using Selenium. Here is an example doing just that; Beautifulsoup unfortunately does not provide the ability to click buttons.

(NOTE: sometimes you can scrape different url’s to solve this issue if clicking "next" loads a different url, but clicking "next" on this page renders new data with Javascript instead of changing the url, meaning Selenium is doubly needed)

Here is the Selenium code to scrape all data provided by the site:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

url = 'https://www.bmv.com.mx/es/mercados/capitales'
driver.get(url)

wait = WebDriverWait(driver, 10)  # 10 second timeout
wait.until(EC.presence_of_element_located((By.ID, "tableMercados_next")))

all_data = []

while True:
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    if not all_data:
        table = soup.find('table', id='tableMercados')
        columns = [th.get_text().strip() for th in table.find_all('th')]
    
    rows = table.find('tbody').find_all('tr')
    for row in rows:
        row_data = [col.get_text().strip() for col in row.find_all('td')]
        all_data.append(row_data)

    # checks if there is a next page; element is disabled if there isnt
    next_button = driver.find_element(By.ID, "tableMercados_next")
    if "disabled" in next_button.get_attribute("class"):
        print("Reached the last page.")
        break 
    # load next page
    next_button.click()
    time.sleep(2)

# create df from data
df = pd.DataFrame(all_data, columns=columns)
driver.quit()
print(df)

This will provide the following dataframe, with 110 rows. There are 10 rows per page and 11 pages:

Please signup or login to give your own answer.

Click here to cancel reply.