how can I scrape the subtitle from a javascript driven page with python

remotepbm
August 6, 2024
82 views
0 votes
2 Answers

I am trying to scrape this site: https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2

I am trying to get the title 'Friday Night Lights' but I don’t seem to get past the javascript.

I am using python and selenium or beautifullsoup.
tried WebDriverWait(driver, 10)
I used webdriver.Chrome.

options = webdriver.ChromeOptions()

options.add_argument("disable-infobars")

options.add_argument("start-maximized")

options.add_argument("disable-dev-shm-usage")

options.add_argument("no-sandbox")

#options.add_experimental_option("prefs", {'profile.managed_default_content_settings.javascript': 2})

options.add_experimental_option("excludeSwitches", ["enable-automation"])

options.add_argument("disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)

driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

page = requests.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#ViewsPageId-BB1kJC5H')))

title = soup.find_all(name="span", class_="title")

print(title)

This returns an empty list, when I print the pagesource I get the HTML before javascript execution so the title is not displayed but in the inspectors HTML I get the completed html after javascript execution which includes the title.

Answers

tricky

the problem is, that the information is within a shadow-DOM element, which can’t be accessed directly. you have to do some extra work:

driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

WebDriverWait(driver, 20).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'gallery-slideshow'))
)

# Step 1: Locate the shadow host element
shadow_host = driver.find_element(By.CSS_SELECTOR, 'gallery-slideshow')

# Step 2: Access the shadow root using JavaScript
shadow_root = driver.execute_script('return arguments[0].shadowRoot', shadow_host)

# Step 3: Interact with elements inside the shadow DOM
shadow_element = shadow_root.find_element(By.CLASS_NAME, 'metadata-container')

print(shadow_element.text.split('n')[0])

should do the trick

- GTK
- August 6, 2024 at 6:40 pm
- 0 votes
0
You can use the API:
```
import requests

url = "https://assets.msn.com/content/view/v2/Detail/nl-be/BB1kJC5H"
response = requests.get(url)
data = response.json()

print(data["slides"])
```
Friday Night Lights is the title of the second slide: data['slides'][1]['title']
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.