skip to Main Content

from this link: https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8

I want to scrape the course codes under the section "programme content". However, every code I run and every time I change it I get the error message that it is "NonType".

Here is an example of the HTML code:

`<div class="sc-fXSgeo BFRgs">
  <div class="courseList">
    <div class="sc-esYiGF ikRlqb ui-card ui-card--course">
      <div class="codeUnitContainer"><div class="code">FM100</div>
      <div class="unit">Half unit</div>
    </div>
    <div class="card__content">
      <h4 class="card__title">
<a href="https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2023_FM100.htm" rel="noopener noreferrer" target="_blank">Introduction to Finance</a>
</h4>
</div>`

Can you write a code that works please – I don’t know where the problem is.

course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})

2

Answers


  1. This page uses JavaScript to load this section but requests and BeautifulSoup can’t run JavaScript and they can’t find this element. And it may need to use Selenium to control real web browser which can run JavaScript.

    When you will use selenium to create driver then you can get HTML and send to Beautifulsoup

    soup = BeautifulSoup(driver.page_source, 'html5lib')
    

    Or you can use directly Selenium to search data

    course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
    course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
    

    Minimal working code:

    #!/usr/bin/env python3
    
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    #from selenium.webdriver.common.keys import Keys
    #from selenium.webdriver.support.ui import WebDriverWait
    #from selenium.webdriver.support import expected_conditions as EC
    #from selenium.common.exceptions import NoSuchElementException, TimeoutException
    
    #from webdriver_manager.chrome import ChromeDriverManager
    from webdriver_manager.firefox import GeckoDriverManager
    
    import time
    
    #import undetected_chromedriver as uc
    
    # ---
    
    import selenium
    print('Selenium:', selenium.__version__)
    
    # ---
    
    url = 'https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance'
    
    #driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
    
    #driver = uc.Chrome(executable_path='/home/furas/bin/chromedriver', service_args=['--quiet'])
    #driver = uc.Chrome()
    
    #driver.maximize_window()
    
    driver.get(url)
    #driver.get("data:text/html;charset=utf-8," + html)
    
    # ---
    
    time.sleep(5)
    
    #text_box.send_keys(Keys.ARROW_DOWN)
    
    #wait = WebDriverWait(driver, 10)
    #all_items = wait.until(EC.visibility_of_element_located((By.XPATH, "//a")))
    #all_items = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a")))
    
    # ---
     
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    
    course_list = soup.find("div", attrs={'class': "courseList"})
    course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})
    
    print('--- code ---', len(course_code))
    for item in course_code:
        print(item.get_text(strip=True, separator='n').split('n'))
    
    # ---
    
    course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
    #course_code = course_list.find_elements(By.XPATH, '//div[@class="sc-esYiGF.ikRlqb.ui-card.ui-card--course"]')
    course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
    #course_code = driver.find_elements(By.CSS_SELECTOR, 'div.courseList"  div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
    print('--- code ---', len(course_code))
    for item in course_code:
        print(item.text.split('n'))
    

    Result:

    It seems Beautifulsoup found more elements

    --- code --- 9
    ['FM100', 'Half unit', 'Introduction to Finance']
    ['EC1A3', 'Half unit', 'Microeconomics I']
    ['EC1B3', 'Half unit', 'Macroeconomics I']
    ['ST102', 'One unit', 'Elementary Statistical Theory']
    ['FM102', 'Half unit', 'Quantitative Methods for Finance']
    ['MA108', 'Half unit', 'Methods in calculus and linear algebra']
    ['LSE100', 'Half unit', 'The LSE Course']
    ['AC102', 'Half unit', 'Elements of Financial Accounting']
    ['ST101', 'Half unit', 'Programming for Data Science']
    --- code --- 7
    ['FM100', 'Half unit', 'Introduction to Finance']
    ['EC1A3', 'Half unit', 'Microeconomics I']
    ['EC1B3', 'Half unit', 'Macroeconomics I']
    ['ST102', 'One unit', 'Elementary Statistical Theory']
    ['FM102', 'Half unit', 'Quantitative Methods for Finance']
    ['MA108', 'Half unit', 'Methods in calculus and linear algebra']
    ['LSE100', 'Half unit', 'The LSE Course']
    

    BTW:

    You may also try to find url used by JavaScript to load data and read it directly from server. I found url which send information as JSON (so it is easer to use) but it needs authorization and it may need to use requests with Session to use cookies.

    So I skiped this idea at this moment.

    Login or Signup to reply.
  2. You need to use a tool like selenium to be able to get the dynamic content which is loaded by javascript, using the request library alone does not get the dynamic html content you saw when you inspected the source code using your browser, instead it gets the static html content…..

    import time
    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    driver = webdriver.Chrome()
    url='https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8'
    drive.get(url)
    time.sleep(5)
    
    driver.quit()
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    course_list = soup.find("div", attrs={'class': "courseList"})
    course_code = course_list.find_all("div", attrs={'class': "code"})
    course_code_list=[i.text for i in course_code]
    print(course_code_list)
    

    Output:

    ['FM100', 'EC1A3', 'EC1B3', 'ST102', 'FM102', 'MA108', 'LSE100', 'AC102', 'ST101']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search