Html - Webscraping with BeautifulSoup only gives NonType error

user24434330
April 17, 2024
161 views
1 vote
2 Answers

from this link: https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8

I want to scrape the course codes under the section "programme content". However, every code I run and every time I change it I get the error message that it is "NonType".

Here is an example of the HTML code:

`<div class="sc-fXSgeo BFRgs">
  <div class="courseList">
    <div class="sc-esYiGF ikRlqb ui-card ui-card--course">
      <div class="codeUnitContainer"><div class="code">FM100</div>
      <div class="unit">Half unit</div>
    </div>
    <div class="card__content">
      <h4 class="card__title">
<a href="https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2023_FM100.htm" rel="noopener noreferrer" target="_blank">Introduction to Finance</a>
</h4>
</div>`

Can you write a code that works please – I don’t know where the problem is.

course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})

Answers

This page uses JavaScript to load this section but requests and BeautifulSoup can’t run JavaScript and they can’t find this element. And it may need to use Selenium to control real web browser which can run JavaScript.

When you will use selenium to create driver then you can get HTML and send to Beautifulsoup

soup = BeautifulSoup(driver.page_source, 'html5lib')

Or you can use directly Selenium to search data

course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')

Minimal working code:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException

#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager

import time

#import undetected_chromedriver as uc

# ---

import selenium
print('Selenium:', selenium.__version__)

# ---

url = 'https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance'

#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))

#driver = uc.Chrome(executable_path='/home/furas/bin/chromedriver', service_args=['--quiet'])
#driver = uc.Chrome()

#driver.maximize_window()

driver.get(url)
#driver.get("data:text/html;charset=utf-8," + html)

# ---

time.sleep(5)

#text_box.send_keys(Keys.ARROW_DOWN)

#wait = WebDriverWait(driver, 10)
#all_items = wait.until(EC.visibility_of_element_located((By.XPATH, "//a")))
#all_items = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a")))

# ---
 
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, 'html5lib')

course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})

print('--- code ---', len(course_code))
for item in course_code:
    print(item.get_text(strip=True, separator='n').split('n'))

# ---

course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
#course_code = course_list.find_elements(By.XPATH, '//div[@class="sc-esYiGF.ikRlqb.ui-card.ui-card--course"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
#course_code = driver.find_elements(By.CSS_SELECTOR, 'div.courseList"  div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
print('--- code ---', len(course_code))
for item in course_code:
    print(item.text.split('n'))

Result:

It seems Beautifulsoup found more elements

--- code --- 9
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']
['AC102', 'Half unit', 'Elements of Financial Accounting']
['ST101', 'Half unit', 'Programming for Data Science']
--- code --- 7
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']

BTW:

You may also try to find url used by JavaScript to load data and read it directly from server. I found url which send information as JSON (so it is easer to use) but it needs authorization and it may need to use requests with Session to use cookies.

So I skiped this idea at this moment.

You need to use a tool like selenium to be able to get the dynamic content which is loaded by javascript, using the request library alone does not get the dynamic html content you saw when you inspected the source code using your browser, instead it gets the static html content…..

import time
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url='https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8'
drive.get(url)
time.sleep(5)

driver.quit()
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "code"})
course_code_list=[i.text for i in course_code]
print(course_code_list)

Output:

['FM100', 'EC1A3', 'EC1B3', 'ST102', 'FM102', 'MA108', 'LSE100', 'AC102', 'ST101']

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Webscraping with BeautifulSoup only gives NonType error

Answers