Can a HTML file be written with Javascript and Python?

Dini
May 20, 2023
197 views
0 votes
2 Answers

I really need help with this. I have to read thousands of HTML files to extract values to a dataset but the HTM file has javascrip and pandas is getting me the name of the variable instead of the value.The code I have used is very simple, use pandas to read it and then look for the tables it found.
I will share a link for the HML file, and the code and an image of the value I am trying to get.
Thank you.

url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
import pandas as pd

df_mill = pd.read_html(url)
print(df_mill[2])

And I get: 0 verificar tempo when I should get: Tempo Total: 03:09:09 that I show in the image.

Answers

- Cacci
- May 20, 2023 at 9:43 pm
- 0 votes
0
First of all i believe this is not the best way to solve this problem. I used selenium web driver to execute javascript. Since the data is hidden behind xml file, i converted it to html then opened the html object with selenium. This allowed the browser to execute the JavaScript. For driver i used undetected_chromedriver since it was easier for me to install.
```
import pandas as pd
import requests
from bs4 import BeautifulSoup
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By

url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
Doc = requests.get(url)
Con = Doc.content 
soup_obj = BeautifulSoup(Con , "lxml") 
driver = uc.Chrome(use_subprocess=True)
driver.get("data:text/html;charset=utf-8," + str(soup_obj))


tableData = driver.find_element(By.ID,"pdTime")
df = pd.read_html(driver.page_source)

print(df[1])
# Close the browser
driver.quit()
```
Don’t forget to install selenium.
Login or Signup to reply.

- VonC
- May 20, 2023 at 9:46 pm
- 0 votes
0
but the HTM file has javascrip and pandas is getting me the name of the variable instead of the value.

That means… you need to actually execute the JavaScript part in order to get those values.
While Pandas is great for handling structured data, like tables in an HTML page, it does not execute JavaScript code. Therefore, it will not be able to retrieve the values set by JavaScript.

One possible approach would be to use selenium 4.9.1, the Python language bindings for Selenium WebDriver: a Python library that allows you to automate browser actions.
Selenium will actually load the webpage in a real browser, execute any JavaScript, and then allow you to access the resulting DOM (Document Object Model), including any modifications made by JavaScript.

The following is an example of how you might use Selenium to retrieve the value you’re looking for:
```
from selenium import webdriver

# path to the chromedriver executable
chromedriver_path = '/path/to/chromedriver'

# create a new browser session
driver = webdriver.Chrome(executable_path=chromedriver_path)

# direct the driver to the webpage
url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
driver.get(url)

# get the value of a JavaScript variable
total_time = driver.execute_script("return totalTime;")

# print the value
print(total_time)

# end the Selenium browser session
driver.quit()
```
This code starts a new browser session, navigates to the desired webpage, and then retrieves the value of the totalTime JavaScript variable.

Do replace ‘/path/to/chromedriver‘ with the path where you’ve installed the ChromeDriver. You can download the ChromeDriver from here.

That is a valid approach for one page. For a thousands… you might encounter resource issues, so test it incrementally! (And in a way which does not look like a DDos attack 😉 Do not do too many requests in too little time).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.