Let’s say I’m trying to scrape transcripts like this one. If you scroll down, you’ll see that there is an h2
element that has both the text "Transcript" and has an id=’transcript’ attribute. If I’m not mistaken, the p
elements that appear "under" the h2
header are actually its siblings, which is why neither of the following two solutions work:
# using rvest
t %>%
html_elements('#transcript') %>%
html_children()
t %>%
html_elements('#transcript p')
So, how would I get just those p
elements?
I tried searching previous SO wisdom, and only found (kind of) similar questions asked by BeautifulSoup users. Nevertheless, this seems like it should be a basic question, so perhaps I’m even more off base than I think I am.
2
Answers
Does this work for you? See comments for an explanation.
With only
rvest