I have to scrape text from a div
that has all the article text but the class name of the div
is not unique so I tried using CSS selectors but it is returning an empty list
.
import requests
from bs4 import BeautifulSoup
def get_page_links(url):
r = requests.get(url)
sp = BeautifulSoup(r.text, 'lxml')
links = sp.select('div.tdb-block-inner td-fix-index')
print(links)
get_page_links(
'https://insights.blackcoffer.com/ai-in-healthcare-to-improve-patient-outcomes/')
3
Answers
Try this CSS selector:
.tdb_single_content .tdb-block-inner.td-fix-index
.For example:
Output:
It is better to select
td-post-content
class rather thentdb-block-inner
, because on other pages element withtdb-block-inner
class could miss. As on pagehttps://insights.blackcoffer.com/what-is-the-future-of-mobile-apps/
, for instance.The article text could be scrapped like this:
There are a number of options to achieve the goal, so I would like to add another one that does not require any classes at all and only uses the elements of the HTML structure – In general, I would recommend working with the structure if the pattern allows it.
First approach would be to select the first
<p>
in the<article>
and then its parent element to get the text from:Or in alternative select all
<p>
in the<article>
and extract and join the text of each:Example
Output