R: Alternatives/approaches to read_html() + html_text() that also work on strings without HTML/XML tags

socialscientist
May 18, 2024
120 views
0 votes
2 Answers

In this solution to removing HTML tags from a string, the string is passed to rvest::read_html() to create an html_document object and then the object is passed to rvest::html_text() to return "HTML-less text."

However, read_html() throws an error if the string does not contain HTML tags because the string is treated as a file/connection path, as demonstrated below. This can be problematic when attempting to remove HTML from many strings that may not contain any tags.

library(rvest)

# Example data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)


# Success: produces html_document object
rvest::read_html(dat[1])
#> {html_document}
#> <html>
#> [1] <body>n<b>Positives:</b> Rangy, athletic build with room for additional  ...


# Error
rvest::read_html(dat[2])
#> Error in `path_to_connection()`:
#> ! 'Positives: Better football player than his measureables would
#>   indicate. ...' does not exist in current working directory
#>   ('C:/LONG_PATH_HERE').

Is there a fast way to ensure read_html() treats each string as xml even if it does not contain any tags or alternatively to remove HTML to the same effect as read_html() |> html_text()?

One idea was to simply append "" or "r" to the end of each string. However, I imagine there is either a more efficient approach that returns the string without any computation when the string lacks any HTML or some way of accomplishing this using the function’s arguments. Other alternatives would involve using regex to remove tags, although doing so violates the "don’t use regex on html" principle.

Tags: data-wrangling html r rvest xml

Answers

You may try charToRaw() function inside the read_html step :

### Packages
library(rvest)
library(purrr)

### Data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)

### Writing a function to convert each string to raw, parse it with read_html then extract the text
clean=function(x) {
  read_html(charToRaw(x)) %>% html_text()
}

### Map the function over the character vector
map_chr(dat,clean,.progress = TRUE)

Output :

[1] "Positives: Rangy, athletic build with room for additional growth. ..."      
[2] "Positives: Better football player than his measureables would indicate. ..."

You can use minimal_html() instead of read_html():

rvest::minimal_html("Positives: Better football player than his measureables would indicate. ...")
#> {html_document}
#> <html>
#> [1] <head>n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body><p>nPositives: Better football player than his measureables would  ...

Though under the the hood it’s just read_html() with few inserted tags:

function (html, title = "") 
{
  xml2::read_html(paste0("<!doctype html>n", "<meta charset=utf-8>n", 
    "<title>", title, "</title>n", html))
}

Please signup or login to give your own answer.

Click here to cancel reply.