skip to Main Content

I am trying to scrape this site: https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2

I am trying to get the title 'Friday Night Lights' but I don’t seem to get past the javascript.

I am using python and selenium or beautifullsoup.
tried WebDriverWait(driver, 10)
I used webdriver.Chrome.

options = webdriver.ChromeOptions()

options.add_argument("disable-infobars")

options.add_argument("start-maximized")

options.add_argument("disable-dev-shm-usage")

options.add_argument("no-sandbox")

#options.add_experimental_option("prefs", {'profile.managed_default_content_settings.javascript': 2})

options.add_experimental_option("excludeSwitches", ["enable-automation"])

options.add_argument("disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)

driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

page = requests.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#ViewsPageId-BB1kJC5H')))

title = soup.find_all(name="span", class_="title")

print(title)

This returns an empty list, when I print the pagesource I get the HTML before javascript execution so the title is not displayed but in the inspectors HTML I get the completed html after javascript execution which includes the title.

2

Answers


  1. tricky

    the problem is, that the information is within a shadow-DOM element, which can’t be accessed directly. you have to do some extra work:

    driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
    
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, 'gallery-slideshow'))
    )
    
    # Step 1: Locate the shadow host element
    shadow_host = driver.find_element(By.CSS_SELECTOR, 'gallery-slideshow')
    
    # Step 2: Access the shadow root using JavaScript
    shadow_root = driver.execute_script('return arguments[0].shadowRoot', shadow_host)
    
    # Step 3: Interact with elements inside the shadow DOM
    shadow_element = shadow_root.find_element(By.CLASS_NAME, 'metadata-container')
    
    print(shadow_element.text.split('n')[0])
    

    should do the trick

    Login or Signup to reply.
  2. You can use the API:

    import requests
    
    url = "https://assets.msn.com/content/view/v2/Detail/nl-be/BB1kJC5H"
    response = requests.get(url)
    data = response.json()
    
    print(data["slides"])
    

    Friday Night Lights is the title of the second slide: data['slides'][1]['title']

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search