I’m brand new to web scraping in R and not very familiar with HTML code. I’m trying to scrape data from the top 50 IMDB movies at https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_250. I know to use read_html which gives me an XML object, and then I know I need to use html_nodes to extract the movie titles. But because I’m not very familiar with html, I’m struggling to figure out what those nodes are named. Can anyone point me in the right direction?
library(rvest)
library(dplyr)
website <- "https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_250"
page <- read_html(website)
movie_titles <- page %>%
html_nodes("node_name_here") %>%
html_text()
2
Answers
You need to use Inspect option when you right click on the webpage to get the correct node. Then you need to find the node that has the information that you need.
There are multiple options but I have used
a
andh3
tags along with it’s class to get correct values.You can do some data cleaning to remove numbers and get only movie names from it.
An alternative approach here would be to use regex.
If we search for the movie titles in this returned list, we can see that they always occur after
"originalTitleText":"
and before another.
For example,"originalTitleText":"Groundhog Day
.We can convert the list to a string, then use regex based on the above pattern to extract the movie titles.
If you want these movies as a dataframe, you could then do: