from this link: https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8
I want to scrape the course codes under the section "programme content". However, every code I run and every time I change it I get the error message that it is "NonType".
Here is an example of the HTML code:
`<div class="sc-fXSgeo BFRgs">
<div class="courseList">
<div class="sc-esYiGF ikRlqb ui-card ui-card--course">
<div class="codeUnitContainer"><div class="code">FM100</div>
<div class="unit">Half unit</div>
</div>
<div class="card__content">
<h4 class="card__title">
<a href="https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2023_FM100.htm" rel="noopener noreferrer" target="_blank">Introduction to Finance</a>
</h4>
</div>`
Can you write a code that works please – I don’t know where the problem is.
course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})
2
Answers
This page uses
JavaScript
to load this section butrequests
andBeautifulSoup
can’t runJavaScript
and they can’t find this element. And it may need to use Selenium to control real web browser which can runJavaScript
.When you will use
selenium
to createdriver
then you can getHTML
and send toBeautifulsoup
Or you can use directly
Selenium
to search dataMinimal working code:
Result:
It seems
Beautifulsoup
found more elementsBTW:
You may also try to find url used by JavaScript to load data and read it directly from server. I found url which send information as
JSON
(so it is easer to use) but it needs authorization and it may need to userequests
withSession
to usecookies
.So I skiped this idea at this moment.
You need to use a tool like selenium to be able to get the dynamic content which is loaded by javascript, using the request library alone does not get the dynamic html content you saw when you inspected the source code using your browser, instead it gets the static html content…..
Output: