In this solution to removing HTML tags from a string, the string is passed to rvest::read_html()
to create an html_document
object and then the object is passed to rvest::html_text()
to return "HTML-less text."
However, read_html()
throws an error if the string does not contain HTML tags because the string is treated as a file/connection path, as demonstrated below. This can be problematic when attempting to remove HTML from many strings that may not contain any tags.
library(rvest)
# Example data
dat <- c(
"<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
"Positives: Better football player than his measureables would indicate. ..."
)
# Success: produces html_document object
rvest::read_html(dat[1])
#> {html_document}
#> <html>
#> [1] <body>n<b>Positives:</b> Rangy, athletic build with room for additional ...
# Error
rvest::read_html(dat[2])
#> Error in `path_to_connection()`:
#> ! 'Positives: Better football player than his measureables would
#> indicate. ...' does not exist in current working directory
#> ('C:/LONG_PATH_HERE').
Is there a fast way to ensure read_html()
treats each string as xml even if it does not contain any tags or alternatively to remove HTML to the same effect as read_html() |> html_text()
?
One idea was to simply append "" or "r" to the end of each string. However, I imagine there is either a more efficient approach that returns the string without any computation when the string lacks any HTML or some way of accomplishing this using the function’s arguments. Other alternatives would involve using regex to remove tags, although doing so violates the "don’t use regex on html" principle.
2
Answers
You may try
charToRaw()
function inside theread_html
step :Output :
You can use
minimal_html()
instead ofread_html()
:Though under the the hood it’s just
read_html()
with few inserted tags: