skip to Main Content

I really need help with this. I have to read thousands of HTML files to extract values to a dataset but the HTM file has javascrip and pandas is getting me the name of the variable instead of the value.The code I have used is very simple, use pandas to read it and then look for the tables it found.
I will share a link for the HML file, and the code and an image of the value I am trying to get.
Thank you.

url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
import pandas as pd

df_mill = pd.read_html(url)
print(df_mill[2])

And I get: 0 verificar tempo when I should get: Tempo Total: 03:09:09 that I show in the image.

enter image description here

2

Answers


  1. First of all i believe this is not the best way to solve this problem. I used selenium web driver to execute javascript. Since the data is hidden behind xml file, i converted it to html then opened the html object with selenium. This allowed the browser to execute the JavaScript. For driver i used undetected_chromedriver since it was easier for me to install.

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    import undetected_chromedriver as uc
    from selenium.webdriver.common.by import By
    
    url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
    Doc = requests.get(url)
    Con = Doc.content 
    soup_obj = BeautifulSoup(Con , "lxml") 
    driver = uc.Chrome(use_subprocess=True)
    driver.get("data:text/html;charset=utf-8," + str(soup_obj))
    
    
    tableData = driver.find_element(By.ID,"pdTime")
    df = pd.read_html(driver.page_source)
    
    print(df[1])
    # Close the browser
    driver.quit()
    

    Don’t forget to install selenium.

    Login or Signup to reply.
  2. but the HTM file has javascrip and pandas is getting me the name of the variable instead of the value.

    That means… you need to actually execute the JavaScript part in order to get those values.
    While Pandas is great for handling structured data, like tables in an HTML page, it does not execute JavaScript code. Therefore, it will not be able to retrieve the values set by JavaScript.

    One possible approach would be to use selenium 4.9.1, the Python language bindings for Selenium WebDriver: a Python library that allows you to automate browser actions.
    Selenium will actually load the webpage in a real browser, execute any JavaScript, and then allow you to access the resulting DOM (Document Object Model), including any modifications made by JavaScript.

    The following is an example of how you might use Selenium to retrieve the value you’re looking for:

    from selenium import webdriver
    
    # path to the chromedriver executable
    chromedriver_path = '/path/to/chromedriver'
    
    # create a new browser session
    driver = webdriver.Chrome(executable_path=chromedriver_path)
    
    # direct the driver to the webpage
    url = "https://raw.githubusercontent.com/OperationsMD/powermill/main/Project_Summary.html"
    driver.get(url)
    
    # get the value of a JavaScript variable
    total_time = driver.execute_script("return totalTime;")
    
    # print the value
    print(total_time)
    
    # end the Selenium browser session
    driver.quit()
    

    This code starts a new browser session, navigates to the desired webpage, and then retrieves the value of the totalTime JavaScript variable.

    Do replace ‘/path/to/chromedriver‘ with the path where you’ve installed the ChromeDriver. You can download the ChromeDriver from here.

    That is a valid approach for one page. For a thousands… you might encounter resource issues, so test it incrementally! (And in a way which does not look like a DDos attack 😉 Do not do too many requests in too little time).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search