skip to Main Content

Let’s say I’m trying to scrape transcripts like this one. If you scroll down, you’ll see that there is an h2 element that has both the text "Transcript" and has an id=’transcript’ attribute. If I’m not mistaken, the p elements that appear "under" the h2 header are actually its siblings, which is why neither of the following two solutions work:

# using rvest

t %>% 
  html_elements('#transcript') %>% 
  html_children()

t %>% 
  html_elements('#transcript p')

So, how would I get just those p elements?

I tried searching previous SO wisdom, and only found (kind of) similar questions asked by BeautifulSoup users. Nevertheless, this seems like it should be a basic question, so perhaps I’m even more off base than I think I am.

2

Answers


  1. Does this work for you? See comments for an explanation.

    library(rvest)
    library(xml2)
    
    #read the page
    url <- "https://80000hours.org/podcast/episodes/kevin-esvelt-stealth-wildfire-pandemics/"
    page <- read_html(url)
    
    #find the h2 elements
    h2_elements <- page %>% html_elements('h2')
    h2_text <- h2_elements %>% html_text()
    
    #select the node with the word "Transcript
    desired_h2 <- h2_elements[grep("Transcript", h2_text)]
    
    #find the parent node of the desired h2
    parent <- xml_parent(desired_h2)
    
    #find all of the child "p" nodes under the parent
    answer <- parent %>% html_elements("p") %>% html_text()
    
    head(answer, 5)
    
    [1] "Table of Contents"                                                                                                                                                                                                                                                                                                                                                            
    [2] "Kevin Esvelt: So scientists correctly appreciate that, when there is controversy, you can get a paper in Nature, Science, or Cell — the top journals which are the best for your career."                                                                                                                                                                                     
    [3] "Therefore, the incentives favour scientists identifying pandemic-capable viruses and determining whether posited cataclysmically destructive viruses and other forms of attack would actually function."                                                                                                                                                                      
    [4] "And I have not seen any appreciable counter-incentives that could be anywhere near as powerful as the ones favouring our desire to know. Because almost all the time, it is better for us to know."                                                                                                                                                                           
    [5] "So I don’t see many plausible futures in which we do not learn how to build agents that would bring down civilisation today. We just know that in the limit, if you get good enough at programming biology, we can do anything t
    
    Login or Signup to reply.
  2. With only rvest

    library(rvest)
    
    page <- "https://80000hours.org/podcast/episodes/kevin-esvelt-stealth-wildfire-pandemics/#transcript" %>% 
      read_html()
    
    page %>% 
      html_elements(".col-md-offset-2.collapse-gradient__target p") %>% 
      html_text2()
    
    [1] "Table of Contents"                                                                                                                                                                                                                                                                                                                                                            
    [2] "Kevin Esvelt: So scientists correctly appreciate that, when there is controversy, you can get a paper in Nature, Science, or Cell — the top journals which are the best for your career."                                                                                                                                                                                     
    [3] "Therefore, the incentives favour scientists identifying pandemic-capable viruses and determining whether posited cataclysmically destructive viruses and other forms of attack would actually function."                                                                                                                                                                      
    [4] "And I have not seen any appreciable counter-incentives that could be anywhere near as powerful as the ones favouring our desire to know. Because almost all the time, it is better for us to know."                                                                                                                                                                           
    [5] "So I don’t see many plausible futures in which we do not learn how to build agents that would bring down civilisation today. We just know that in the limit, if you get good enough at programming biology, we can do anything that nature can do — and nature can do the kind of pathogen that is necessary to kill billions and set back civilisation by at least a century"
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search