I am new to web scraping.
I am using the rvest package in R to scrape web content and I want to select paragraphs (
) that do not contain links ().
So far, I have not been very successful with this approach:
html <- read_html("https://www.news4teachers.de/2023/08/schaemt-euch-deutschland-steht-vor-den-vereinten-nationen-am-pranger-weil-es-die-inklusion-an-schulen-verweigert/")
html |>
html_elements("article") |>
html_elements("p") |>
html_elements(":not(a)")
3
Answers
You could select all the
<p>
tags and then filter them in R if they have any<a>
tags. For exampleTo get all the
<p>
tags which do not contain any<a>
tags, you can use an xpath expression:Created on 2023-09-21 with reprex v2.0.2
Unfortunately it seems that the pseudo-class
:has()
is not supported byselectr
/cssselect
which are used byrvest
for parsing css selectors.Otherwise something like this would work:
We can work around this by converting the
xml_nodelist
tocharacter
andstringr::str_detect()
which<p>
elements have an<a>
in them. Then wesubset the
xml_nodelist
to only include those that don’t have link/<a>
inthem.