skip to Main Content

This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

This code prints:

To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.

BUT I ONLY want it to print:

To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.

Thanks for help

4

Answers


  1. Split the text on that dot; for a single split, using str.partition() is faster than str.split() with a limit:

    text = soup.find_all('p')[0].get_text()
    if len(text) > 100:
        text = text.partition('.')[0] + '.'
    print(text)
    

    If you only need to process the first <p> element, use soup.find() instead:

    text = soup.find('p').get_text()
    if len(text) > 100:
        text = text.partition('.')[0] + '.'
    print(text)
    

    For your given URL, however, the sample text is found as the second paragraph:

    >>> soup.find_all('p')[1]
    <p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
    >>> text = soup.find_all('p')[1].get_text()
    >>> text.partition('.')[0] + '.'
    'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
    
    Login or Signup to reply.
  2. def print_intro():
        if len(soup.find_all('p')[0].get_text()) > 100:
            paragraph = soup.find_all('p')[0].get_text()
            phrase_list = paragraph.split('.')
            print(phrase_list[0])
    
    Login or Signup to reply.
  3. split the paragraph at the first period. Argument 1 species the MAXSPLIT and saves your time from unneccessary extra splitting.

    def print_intro():
        if len(soup.find_all('p')[0].get_text()) > 100:
            my_paragraph = soup.find_all('p')[0].get_text()
            my_list = my_paragraph.split('.', 1)
            print(my_list[0])
    
    Login or Signup to reply.
  4. you can use find('.'), it return the index of the first occurence of what you’re looking for.

    So if the paragraph is stored in a variable called paragraph

    sentence_index = paragraph.find('.')
    # add the '.'
    sentence += 1
    print(paragraph[0: sentence_index])
    

    Obviously here is missing the control part like check if the string contained in paragraph variable has ‘.’ etc.. anyway find() return -1 if it does not find the substring you’re looking for.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search