skip to Main Content

I am trying to make a web crawler – scraper to get the news.
I want to remove elements that are in a specific class. But, the problem is that this class is nested in another class.
The code is below:

import requests
from bs4 import BeautifulSoup

url = 'https://www.moneyreview.gr/life-and-arts/86916/mia-apli-lysi-gia-to-rochalito-to- 
kolpo-poy-sozei-chiliades-gamoys/'

r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
title = soup1.find('h1').get_text()
article = requests.get(url)
article_content = article.content

soup_article = BeautifulSoup(article_content, 'html5lib')
body = soup_article.find_all('div', class_='entry-content')

The unwanted elements
Inside the text of the article there is also the text of a tweet. I want to remove this text and all twitter tags etc from the article text so that I have a clean text.
I wrote this code to print everything inside this twitter tag:

for elements in body:
   quote = soup1.find_all('blockquote', class_= "twitter-tweet")
   print(quote)

I get this result :

enter image description here

With the code below I put the paragraphs of the text in a list:

x = body[0].find_all('p')
list_paragraphs = []

for p in np.arange(0, len(x)):
    paragraph = x[p].text.replace("n", " ")
    list_paragraphs.append(paragraph)

Where the problem is:
I want everything inside the list quote to be removed from the list list_paragraphs.
But all I tried so far, failed.

my_list = []

for i in quote:
   if i:
       my_list.append(i.text.strip())
print(my_list)

enter image description here

Attempt 1

l3 = [x for x in list_paragraphs if x not in my_list]
print(l3)

Attempt 2

for element in my_list:
    if element in list_paragraphs:
        list_paragraphs.remove(element)

Can you suggest something to do?

2

Answers


  1. If I understand correctly, you just want all the text inside the <p> tags, unless it is enclosed in a <blockquote> with the class twitter-tweet. If that is the case, I would say, the easiest way to accomplish this, is to simply get rid of all the offending <blockquote> tags. You can do that with decompose for instance. Basically you find all those tags as you already did, using find_all and then call .decompose() on each one.

    I took the liberty of optimizing your code a bit, as I understood it. Here is my suggestion:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = ...
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html5lib')
    for quote in soup.find_all('blockquote', class_="twitter-tweet"):
        quote.decompose()
    content_div = soup.find('div', class_='entry-content')
    list_paragraphs = []
    for p in content_div.find_all('p'):
        text = p.get_text(strip=True).replace("n", " ")
        if text:
            list_paragraphs.append(text)
    
    print(list_paragraphs)
    
    1. In your original code you made the same request to that url twice for some reason, so I reduced it to a single request.
    2. When you know that you always want the first <div> with the class entry-content, you can use find instead of find_all.
    3. Calling .get_text(strip=True) on each paragraph tag strips the text of all whitespace.
    4. I thought it would make little sense to keep empty strings in your list_paragraphs, so inside the loop we only append a text, if it is not empty.

    Hope this helps.

    Login or Signup to reply.
  2. Another approach would be using extract to remove twitter content as shown below:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'your_url'
    unwanted_tags = ['blockquote']
    
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    # remove tweets from content
    for tag in unwanted_tags: [i.extract() for i in soup(tag)]
    
    main_content = [i for i in list(map(lambda x: x.get_text(), soup.find_all('p'))) if i not in ['','n']]
    print(''.join(main_content))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search