I am trying to fetch as rows the different values inside each href element from the following website: https://www.bmv.com.mx/es/mercados/capitales
There should be 1 row that matches each field on the provided headers for each different href element on the html file.
This is one of the portions of the html that I am trying to scrap:
<tbody>
<tr role="row" class="odd">
<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC
</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">191.04
</span></td><td>191.32</td>
<td>194.51</td>
<td>193.92</td>
<td>191.01</td>
<td>380,544</td>
<td>73,122,008.42</td>
<td>2,793</td>
<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>
</td>
<td><span class="series">B</span>
</td><td>03:20</td><td>
<span class="">22.5</span></td><td>0</td>
<td>22.5</td><td>0</td><td>0
</td><td>3</td><td>67.20</td>
<td>1</td><td>0</td><td>0</td></tr>
<tr role="row" class="odd">
<td class="sorting_1">
<a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>
<td><span class="series">B</span></td><td>03:20</td><td>
<span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>
<td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>
<td>0</td></tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/339083">AGUA</a></td>
<td><span class="series">*</span>
</td><td>03:20</td><td>
<span class="color-1">29</span>
</td><td>28.98</td><td>28.09</td>
<td>29</td><td>28</td><td>296,871</td>
<td>8,491,144.74</td><td>2,104</td><td>0.89</td>
<td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>
<td>03:20</td>
<td><span class="color-2">13.48</span>
</td><td>13.46</td>
<td>13.53</td><td>13.62</td><td>13.32</td>
<td>2,706,398</td>
td>36,494,913.42</td><td>7,206</td><td>-0.07</td>
<td>-0.52</td>
</tr><tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>
</td><td>03:20</td><td><span class="color-2">10.65</span>
</td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>
<td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>
<td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>
</td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>
<tr role="row" class="even"><td class="sorting_1">
<a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>
<td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">
<a href="/es/mercados/cotizacion/1862">AMX</a></td>
<td><span class="series">B</span></td><td>03:20</td>
<td><span class="color-2">14.56</span></td><td>14.58</td>
<td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>
<td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>
<td>-0.75</td></tr><tr role="row" class="even">
<td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>
</td><td><span class="series">10</span></td><td>03:20</td><td>
<span class="color-2">21.09</span>
</td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>
<td>51,005</td><td>1,076,281.67</td>
<td>22</td><td>-0.34</td><td>-1.59</td></tr>
</tbody>
And my current code results into an empty dataframe:
# create empty pandas dataframe
import pandas as pd
import requests
from bs4 import BeautifulSoup
# get response code from webhost
page = requests.get('https://www.bmv.com.mx/es/mercados/capitales')
soup = BeautifulSoup(page.text, 'lxml')
#print(soup.p.text)
# yet it doesn't bring the expected rows!
print('Read html!')
# get headers
tbody = soup.find("thead")
tr = tbody.find_all("tr")
headers= [t.get_text().strip().replace('n', ',').split(',') for t in tr][0]
#print(headers)
df = pd.DataFrame(columns=headers)
# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings
rows = soup.find_all('tr', {"role":"row"})
#rows
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
#print("The value in this cell is %s" % value)
# append row in dataframe
I would like to know if it’s possible to get a pandas dataframe whose fields are
the ones portrayed in the headers list and the rows are each element from href.
For better perspective, the expected output should be equal to the table at the bottom of the provided website. Whose first row has the next schema:
EMISORA SERIE HORA ÚLTIMO PPP ANTERIOR MÁXIMO MÍNIMO VOLUMEN IMPORTE OPS. VAR PUNTOS VAR %
AC * 3:20 191.04 191.32 194.51 193.92 191.01 380,544 73,122,008.42 2,793 -3.19 -1.64
Is this possible to create such dataset?
2
Answers
You can use
BeautifulSoup
to parse the HTML and extract the necessary information from thehref
attributes. Then, construct a pandas DataFrame using this information.Try this:
The website you provided uses Javascript to load content dynamically, meaning
Beautifulsoup
will receive HTML in its response, but will be missing significant pieces. For this reason, you will have to integrate yourBeautifulsoup
code with a more sophisticated tool likeSelenium
.Most of your code is correct, but is working with only a subset of the data that is shown on the page due to dynamic rendering. I have adjusted your code to include
Selenium
code; this will navigate to the website, wait for the Javascript to load, scrape the newly-rendered data, and make the table into a Pandas dataframe:Printing the Dataframe shows the following data:
Which matches the data on the website:
EDIT:
If you wish to scrape all of the data that the website provides, you will need to click the "next" button using
Selenium
. Here is an example doing just that;Beautifulsoup
unfortunately does not provide the ability to click buttons.(NOTE: sometimes you can scrape different url’s to solve this issue if clicking "next" loads a different url, but clicking "next" on this page renders new data with Javascript instead of changing the url, meaning
Selenium
is doubly needed)Here is the Selenium code to scrape all data provided by the site:
This will provide the following dataframe, with 110 rows. There are 10 rows per page and 11 pages: