skip to Main Content

I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales

There should be 1 row that matches each field on the provided headers for each different href element on the html file.

This is one of the portions of the html that I am trying to scrap:


  <tbody>
    
  <tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
  
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04

</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
  </td>
  <td><span class="series">B</span>
  </td><td>03:20</td><td>
    <span class="">22.5</span></td><td>0</td>
    <td>22.5</td><td>0</td><td>0

    </td><td>3</td><td>67.20</td>
    <td>1</td><td>0</td><td>0</td></tr>
    <tr role="row" class="odd">
      <td class="sorting_1">
        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
      <td><span class="series">B</span></td><td>03:20</td><td>
        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>
          <td><span class="series">*</span>
          </td><td>03:20</td><td>
            <span class="color-1">29</span>
          </td><td>28.98</td><td>28.09</td>
            <td>29</td><td>28</td><td>296,871</td>
            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>
            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
              <td>03:20</td>
              <td><span class="color-2">13.48</span>
              </td><td>13.46</td>
              <td>13.53</td><td>13.62</td><td>13.32</td>
              <td>2,706,398</td>
              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
              <td>-0.52</td>
            </tr><tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
              </td><td>03:20</td><td><span class="color-2">10.65</span>
            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
            <tr role="row" class="even"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
              <a href="/es/mercados/cotizacion/1862">AMX</a></td>
              <td><span class="series">B</span></td><td>03:20</td>
              <td><span class="color-2">14.56</span></td><td>14.58</td>
              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
              <td>-0.75</td></tr><tr role="row" class="even">
                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
              </td><td><span class="series">10</span></td><td>03:20</td><td>
                <span class="color-2">21.09</span>
              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
              <td>51,005</td><td>1,076,281.67</td>
              <td>22</td><td>-0.34</td><td>-1.59</td></tr>
      </tbody>

And my current code results into an empty dataframe:

# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup


# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!

print('Read html!')

# get headers

tbody = soup.find("thead")
tr = tbody.find_all("tr")

headers= [t.get_text().strip().replace('n', ',').split(',') for t in tr][0]

#print(headers)

df = pd.DataFrame(columns=headers)

# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string

        #print("The value in this cell is %s" % value)

        # append row in dataframe

I would like to know if it’s possible to get a pandas dataframe whose fields are
the ones portrayed in the headers list and the rows are each element from href.

For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:

EMISORA SERIE   HORA    ÚLTIMO   PPP    ANTERIOR    MÁXIMO  MÍNIMO VOLUMEN  IMPORTE OPS.    VAR PUNTOS  VAR %
AC        *    3:20    191.04   191.32  194.51     193.92   191.01  380,544  73,122,008.42   2,793  -3.19    -1.64

Is this possible to create such dataset?

2

Answers


  1. You can use BeautifulSoup to parse the HTML and extract the necessary information from the href attributes. Then, construct a pandas DataFrame using this information.

    Try this:

    # Find all rows
    rows = soup.find_all('tr')
    
    # Iterate over each row
    for row in rows:
        # Find the anchor tag within the row
        anchor = row.find('a')
        if anchor:
            # Extract href and text content
            href = anchor['href']
            text = anchor.text.strip()
            
            # Find all cells in the row
            cells = row.find_all('td')
            # Extract other cell values
            cell_values = [cell.text.strip() for cell in cells]
            
            # Combine all values into a single row
            row_data = [text, *cell_values]
            
            # Append the row data to the main data list
            data.append(row_data)
    
    # Create a DataFrame
    df = pd.DataFrame(data, columns=headers)
    
    Login or Signup to reply.
  2. The website you provided uses Javascript to load content dynamically, meaning Beautifulsoup will receive HTML in its response, but will be missing significant pieces. For this reason, you will have to integrate your Beautifulsoup code with a more sophisticated tool like Selenium.

    Most of your code is correct, but is working with only a subset of the data that is shown on the page due to dynamic rendering. I have adjusted your code to include Selenium code; this will navigate to the website, wait for the Javascript to load, scrape the newly-rendered data, and make the table into a Pandas dataframe:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import time
    import pandas as pd
    
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=options)
    
    url = 'https://www.bmv.com.mx/es/mercados/capitales'
    
    driver.get(url)
    
    time.sleep(4) # wait for page to load
    
    html_content = driver.page_source
    driver.quit()
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    table = soup.find('table', id='tableMercados')
    
    #get column names
    columns = [th.get_text().strip() for th in table.find_all('th')]
    data = []
    rows = table.find('tbody').find_all('tr')
    
    for row in rows:
        row_data = [col.get_text().strip() for col in row.find_all('td')]
        data.append(row_data)
    
    df = pd.DataFrame(data, columns=columns)
    
    print(df)
    

    Printing the Dataframe shows the following data:

       EMISORA SERIE   HORA ÚLTIMO     PPP ANTERIOR  MÁXIMO  MÍNIMO     VOLUMEN           IMPORTE    OPS. VAR PUNTOS  VAR %
    0       AC     *  03:20  189.3  189.32   191.91  192.98  189.01     502,297     95,483,093.50   3,333      -2.59  -1.35
    1  ACCELSA     B  03:20   22.5       0     22.5       0       0           1             22.40       1          0      0
    2  ACTINVR     B  03:20  15.03       0       15   15.06   15.03      55,479        833,864.22      22       0.03   0.20
    3     AGUA     *  03:20  31.55   31.59     30.2   31.98    29.8     327,380     10,149,130.38   2,283       1.39   4.60
    4  ALEATIC     *  03:20   36.5       0     36.5       0       0           7            254.50       2          0      0
    5     ALFA     A  03:20  13.19   13.19    13.04    13.4   13.01   4,045,917     53,405,549.08  11,798       0.15   1.15
    6    ALPEK     A  03:20  10.84   10.82    10.73   10.93   10.67     369,472      3,999,136.42   4,451       0.09   0.84
    7    ALSEA     *  03:20  66.75   66.55    64.98   66.98   65.12   5,939,525    390,969,784.88   3,995       1.57   2.42
    8  ALTERNA     B  03:20    1.5       0      1.5       0       0          50             77.50       2          0      0
    9      AMX     B  03:20  15.27   15.29    15.01   15.38   15.01  96,328,174  1,472,868,239.86  48,894       0.28   1.87
    

    Which matches the data on the website:

    enter image description here

    EDIT:
    If you wish to scrape all of the data that the website provides, you will need to click the "next" button using Selenium. Here is an example doing just that; Beautifulsoup unfortunately does not provide the ability to click buttons.

    (NOTE: sometimes you can scrape different url’s to solve this issue if clicking "next" loads a different url, but clicking "next" on this page renders new data with Javascript instead of changing the url, meaning Selenium is doubly needed)

    Here is the Selenium code to scrape all data provided by the site:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup
    import time
    import pandas as pd
    
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=options)
    
    url = 'https://www.bmv.com.mx/es/mercados/capitales'
    driver.get(url)
    
    wait = WebDriverWait(driver, 10)  # 10 second timeout
    wait.until(EC.presence_of_element_located((By.ID, "tableMercados_next")))
    
    all_data = []
    
    while True:
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        if not all_data:
            table = soup.find('table', id='tableMercados')
            columns = [th.get_text().strip() for th in table.find_all('th')]
        
        rows = table.find('tbody').find_all('tr')
        for row in rows:
            row_data = [col.get_text().strip() for col in row.find_all('td')]
            all_data.append(row_data)
    
        # checks if there is a next page; element is disabled if there isnt
        next_button = driver.find_element(By.ID, "tableMercados_next")
        if "disabled" in next_button.get_attribute("class"):
            print("Reached the last page.")
            break 
        # load next page
        next_button.click()
        time.sleep(2)
    
    # create df from data
    df = pd.DataFrame(all_data, columns=columns)
    driver.quit()
    print(df)
    

    This will provide the following dataframe, with 110 rows. There are 10 rows per page and 11 pages:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search