I’m trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.
The when link name is here .. However the little source html portion as follows..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
I have tried below code:
This code just fetched the Book name or Title but still has header <h2>
printing. I am looking forward to print Book name
and book’s pdf link as well.
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='n')
#divs = soup.find_all('div')
#print(*divs, sep="nn")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
output:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
Desired Output:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Please help me understand how to achive this as I have googled around but due to lack of knowlede i’m unable to get it. as when i see the html source there are lot of div
and class
, so little confused to opt which class to fetch the href
and h2
.
2
Answers
You can get the main idea from this code:
The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with
row
andcol-[size]-[gridcount]
classes you can mostly ignore.You essentially have:
<div class="book">
per book<div class="book-cats">
category and<div class="star-ratings">
ratings block<h2>
book title<span class="meta-auth">
author line<p>
book description<a class=“btn" ...>
Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use
element.nested_element
to grab either.So all you have to do is
book
divs.h2
and firsta
elements.h2
href
attribute of thea
anchor link.like this:
I used
.select_one()
with a CSS selector to be a bit more specific about what link element to accept;.btn
specifies the class and[href]
that ahref
attribute must be present.I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the
:has(...)
selector limits matches to those with specific child elements.The above produces: