I am trying to scrape this site: https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2
I am trying to get the title 'Friday Night Lights'
but I don’t seem to get past the javascript.
I am using python and selenium or beautifullsoup.
tried WebDriverWait(driver, 10)
I used webdriver.Chrome.
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
#options.add_experimental_option("prefs", {'profile.managed_default_content_settings.javascript': 2})
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
page = requests.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#ViewsPageId-BB1kJC5H')))
title = soup.find_all(name="span", class_="title")
print(title)
This returns an empty list, when I print the pagesource I get the HTML before javascript execution so the title is not displayed but in the inspectors HTML I get the completed html after javascript execution which includes the title.
2
Answers
tricky
the problem is, that the information is within a shadow-DOM element, which can’t be accessed directly. you have to do some extra work:
should do the trick
You can use the API:
Friday Night Lights
is the title of the second slide:data['slides'][1]['title']