skip to Main Content

I’m using the python chrome webdriver to extract all the text from the description section of a geocaching website (Here’s a sample website if anyone wants to take a look). The text is stored in different <p> elements inside one <span> element. I cannot figure out how to take all the <p> elements and save them as one string separated with spaces.

I tried using both of the solutions underneath, the first one only outputted the text from the first <p> element and the second one sometimes outputted the first one, sometimes more (but not all). I couldn’t figure out why the second one is inconsistent with the number of elements.

desc_span = driver.find_element(By.XPATH, '/html/body/form[1]/main/div/div/div[2]/div[9]/span')
        p_elements = desc_span.find_elements(By.TAG_NAME, 'p')
        desc = ' '.join(p_element.text for p_element in p_elements)
        print(desc)
desc_div = driver.find_element(By.XPATH, '/html/body/form[1]/main/div/div/div[2]/div[9]')
        all_elements = desc_div.find_elements(By.XPATH, '*') 
        desc = ' '.join(element.text for element in all_elements)
        print(desc)

2

Answers


  1. I think, you should wait for visibility of all elements located by selector, and, probably, change selector.

    Try code below:

    wait = WebDriverWait(driver, 10)
    p_elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.UserSuppliedContent p')))
    desc = ' '.join(p_element.text for p_element in p_elements)
    print(desc)
    

    Using your link (I am not logged in), output is It's a "W" Thang. This is my first cache that I have submitted. Placed with permission. Magnetic that corresponding all p tags in description section.
    So, you’re on right way, just need to wait until all elements are rendered.

    Login or Signup to reply.
  2. The desired texts are within <p> tags which have an ancestor <div class="UserSuppliedContent">


    Solution

    To extract all the text from the description section of the geocaching website and put into a list you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategy:

    driver.get(url='https://www.geocaching.com/geocache/GC4ZJ9R')
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.UserSuppliedContent p")))])
    

    Console Output:

    ['It's a "W" Thang. This is my first cache that I have submitted. Placed with permission.', 'Magnetic']
    

    Further, if you want to take all the <p> elements and save them as one string separated with spaces you need to use join() and you can use the following solution:

    driver.get(url='https://www.geocaching.com/geocache/GC4ZJ9R')
    print("".join(my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.UserSuppliedContent p")))))
    

    Console Output:

    It's a "W" Thang. This is my first cache that I have submitted. Placed with permission.Magnetic
    

    Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search