I am trying to scrape the html tags for other uses, but it seems the arrangement of the html file prevents me using ordinary codes I can find to locate the elements
The element is rather long (there are 70+ options in the list), so I am showing only the first few items here:
<select title="Please select Automatic Weather Observations" id="aws" onchange="changeYearDropDown2()">
<option value="">===Manned Weather Station===</option>
<option value="HKO">Hong Kong Observatory</option>
<option value="HKA">Hong Kong International Airport</option>
<option value=""></option>
<option value="">===Automatic Weather Station===</option>
<option value="BR1">Beas River</option>
<option value="BHD">Bluff Head</option>
<option value="CP1">Central Pier</option>
<option value="CCH">Cheung Chau</option>
<option value="CPH">Ching Pak House(Tsing Yi)</option>
What I actually want are the values inside the tag AND the full name of stations as two separate lists. And since there are also some ‘blank’ tags in the html, I just don’t know how to get rid of them and extract only those with values and names…
Any help is appreciated! Thanks a lot in advance!!
Here are some methods I tried, and I don’t know why the find_all or find_elements fail:
Using BeautifulSoup:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.hko.gov.hk/en/cis/climat.htm'
req = requests.get(url)
soup = bs(req.text, 'html.parser')
#method 1
print(soup.select) #it prints the whole document for me
print(soup.option) #empty list
#method 2
titles = soup.find_all('option')
print(titles) #empty list
Using Selenium:
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from selenium.webdriver.common.by import By
service = Service(executable_path="path/chromedriver_win32/chromedriver.exe")
#initialize web driver
with webdriver.Chrome(service=service) as driver:
#navigate to the url
driver.get(url)
#method 3: find element by CSS selector
myDiv = driver.find_element(By.CSS_SELECTOR, "select option") #with or without '>' returns the same output, and find_elements ALWAYS gets me an error
print(myDiv.get_attribute("outerHTML"))
print(myDiv.get_attribute("innerHTML")) #exc <select></select> results
#method 4: find element by tag name
myDiv2 = driver.find_element(By.TAG_NAME, 'select')
print(myDiv2.get_attribute("outerHTML"))
#by far the 'best': return a long line of html with all the html tags and name of stations
#method 5: run a loop
#still in the same with function shown above
list = []
for r in range(1,74): #there are 73 list elements
value = driver.find_element(By.XPATH, "/html/body/div[2]/div[2]/div/div/div[4]/div/div[3]/div[1]/div/div[2]/div[2]/div[1]/div/ul/li[1]/select/option["+str(r)+"]").text
list.append(value)
print(list)
#not bad but I still can't get the tag values
#using print() just gives me blank list
2
Answers
You have to wait for the options to load. Most probably the options are being dynamically generated in JS so that’s why beautifulsoup is not working.
You may try with selenium this way: