This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
This code prints:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.
BUT I ONLY want it to print:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.
Thanks for help
4
Answers
Split the text on that dot; for a single split, using
str.partition()
is faster thanstr.split()
with a limit:If you only need to process the first
<p>
element, usesoup.find()
instead:For your given URL, however, the sample text is found as the second paragraph:
split
the paragraph at the firstperiod
. Argument1
species theMAXSPLIT
and saves your time from unneccessary extra splitting.you can use
find('.')
, it return the index of the first occurence of what you’re looking for.So if the paragraph is stored in a variable called
paragraph
Obviously here is missing the control part like check if the string contained in
paragraph
variable has ‘.’ etc.. anyway find() return -1 if it does not find the substring you’re looking for.