I really need help with this. I have to read thousands of HTML files to extract values to a dataset but the HTM file has javascrip and pandas is getting me the name of the variable instead of the value.The code I have used is very simple, use pandas to read it and then look for the tables it found.
I will share a link for the HML file, and the code and an image of the value I am trying to get.
Thank you.
url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
import pandas as pd
df_mill = pd.read_html(url)
print(df_mill[2])
And I get: 0 verificar tempo when I should get: Tempo Total: 03:09:09 that I show in the image.
2
Answers
First of all i believe this is not the best way to solve this problem. I used selenium web driver to execute javascript. Since the data is hidden behind xml file, i converted it to html then opened the html object with selenium. This allowed the browser to execute the JavaScript. For driver i used undetected_chromedriver since it was easier for me to install.
Don’t forget to install selenium.
That means… you need to actually execute the JavaScript part in order to get those values.
While Pandas is great for handling structured data, like tables in an HTML page, it does not execute JavaScript code. Therefore, it will not be able to retrieve the values set by JavaScript.
One possible approach would be to use selenium 4.9.1, the Python language bindings for Selenium WebDriver: a Python library that allows you to automate browser actions.
Selenium will actually load the webpage in a real browser, execute any JavaScript, and then allow you to access the resulting DOM (Document Object Model), including any modifications made by JavaScript.
The following is an example of how you might use Selenium to retrieve the value you’re looking for:
This code starts a new browser session, navigates to the desired webpage, and then retrieves the value of the
totalTime
JavaScript variable.Do replace ‘
/path/to/chromedriver
‘ with the path where you’ve installed the ChromeDriver. You can download the ChromeDriver from here.That is a valid approach for one page. For a thousands… you might encounter resource issues, so test it incrementally! (And in a way which does not look like a DDos attack 😉 Do not do too many requests in too little time).