I am trying to make a web crawler – scraper to get the news.
I want to remove elements that are in a specific class. But, the problem is that this class is nested in another class.
The code is below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.moneyreview.gr/life-and-arts/86916/mia-apli-lysi-gia-to-rochalito-to-
kolpo-poy-sozei-chiliades-gamoys/'
r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
title = soup1.find('h1').get_text()
article = requests.get(url)
article_content = article.content
soup_article = BeautifulSoup(article_content, 'html5lib')
body = soup_article.find_all('div', class_='entry-content')
The unwanted elements
Inside the text of the article there is also the text of a tweet. I want to remove this text and all twitter tags etc from the article text so that I have a clean text.
I wrote this code to print everything inside this twitter tag:
for elements in body:
quote = soup1.find_all('blockquote', class_= "twitter-tweet")
print(quote)
I get this result :
With the code below I put the paragraphs of the text in a list:
x = body[0].find_all('p')
list_paragraphs = []
for p in np.arange(0, len(x)):
paragraph = x[p].text.replace("n", " ")
list_paragraphs.append(paragraph)
Where the problem is:
I want everything inside the list quote
to be removed from the list list_paragraphs
.
But all I tried so far, failed.
my_list = []
for i in quote:
if i:
my_list.append(i.text.strip())
print(my_list)
Attempt 1
l3 = [x for x in list_paragraphs if x not in my_list]
print(l3)
Attempt 2
for element in my_list:
if element in list_paragraphs:
list_paragraphs.remove(element)
Can you suggest something to do?
2
Answers
If I understand correctly, you just want all the text inside the
<p>
tags, unless it is enclosed in a<blockquote>
with the classtwitter-tweet
. If that is the case, I would say, the easiest way to accomplish this, is to simply get rid of all the offending<blockquote>
tags. You can do that withdecompose
for instance. Basically you find all those tags as you already did, usingfind_all
and then call.decompose()
on each one.I took the liberty of optimizing your code a bit, as I understood it. Here is my suggestion:
<div>
with the classentry-content
, you can usefind
instead offind_all
..get_text(strip=True)
on each paragraph tag strips the text of all whitespace.list_paragraphs
, so inside the loop we only append atext
, if it is not empty.Hope this helps.
Another approach would be using
extract
to remove twitter content as shown below: