extracting <h2> title text from html where title text might include newlines

pluke
January 8, 2024
96 views
0 votes
2 Answers

I have an html file with some <h2> tags such as

a <- '<section id="sec-standard-stoet-geary" class="level2" data-number="9.4">
      <h2 data-number="9.4" class="anchored" data-anchor-id="sec-standard-stoet-geary">
      <span class="header-section-number">9.4</span> Standardising PISA results</h2>'

b <- '<span class="fu">read_parquet</span>(<span 
     class="st">"&lt;folder&gt;PISA_2015_student_subset.parquet"</span>)</span></code><button 
     title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre> 
     </div>
     </div>
     </section><section id="sec-leftjoin" class="level2" data-number="9.3"><h2 data-number="9.3" 
     class="anchored" data-anchor-id="sec-leftjoin">
     <span class="header-section-number">9.3</span> Linking data using <code>left_join</code>
     </h2>
     <p>some text</p>'

c <- paste(a,b,a)

I can extract the title of the a using:

str_extract_all(a, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results"

But trying this on b returns nothing:

str_extract_all(b, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> character(0)

and c only returns the first and third instance of h2 when it should return all instances:

str_extract_all(c, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results" "Standardising PISA results"

EDIT: from the comments this appears to be the regex not being able to parse the newline characters.

I’ve tried enabling single line mode in regex (?s) for the parsing, but it’s still not working

Tags: extract html r text

Answers

- AllanCameron
- January 8, 2024 at 10:59 pm
- 0 votes
0
I would use an html parser instead of regex here:
```
library(rvest)

read_html(a) |> html_elements("h2") |> html_text() |> trimws()
#> [1] "9.4 Standardising PISA results"

read_html(b) |> html_elements("h2") |> html_text() |> trimws()
#> [1] "9.3 Linking data using left_join"
```
Login or Signup to reply.

Here’s a helper function that will choose H2 eleements with spans but will ignore the spans

library(xml2)
library(stringr)

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  temp %>% xml_text() %>% str_squish()  
}

geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using left_join"

If you wanted to keep the markup inside the H2, this could work

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  temp %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish()  
}
geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using <code>left_join</code>"

For a version that will work with multiple H2 tags, you can use

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  cleanup <- . %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish() 
  sapply(temp, cleanup)
}
geth2(c)
# [1] "Standardising PISA results"
# [2] "Linking data using <code>left_join</code>"
# [3] "Standardising PISA results"

Please signup or login to give your own answer.

Click here to cancel reply.