Html - New to web scraping in R - how to use the rvest package to scrape IMDB movie data?

androsrj
January 10, 2024
138 views
0 votes
2 Answers

I’m brand new to web scraping in R and not very familiar with HTML code. I’m trying to scrape data from the top 50 IMDB movies at https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_250. I know to use read_html which gives me an XML object, and then I know I need to use html_nodes to extract the movie titles. But because I’m not very familiar with html, I’m struggling to figure out what those nodes are named. Can anyone point me in the right direction?

library(rvest)
library(dplyr)
website <- "https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_250"
page <- read_html(website)
movie_titles <- page %>%
  html_nodes("node_name_here") %>%
  html_text()

Tags: html r rvest web-scraping

Answers

You need to use Inspect option when you right click on the webpage to get the correct node. Then you need to find the node that has the information that you need.

There are multiple options but I have used a and h3 tags along with it’s class to get correct values.

library(rvest)
library(dplyr)
website <- "https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_250"

page <- read_html(website)
movie_titles <- page %>%
  html_nodes("a h3.ipc-title__text") %>%
  html_text() 

movie_titles 

movie_titles
# [1] "1. The Shawshank Redemption"                         
# [2] "2. 12th Fail"                                        
# [3] "3. The Godfather"                                    
# [4] "4. The Dark Knight"                                  
# [5] "5. The Lord of the Rings: The Return of the King"    
# [6] "6. Schindler's List"                                 
# [7] "7. The Godfather Part II"                            
# [8] "8. 12 Angry Men"                                     
# [9] "9. The Lord of the Rings: The Fellowship of the Ring"
#...
#...
#[47] "47. American History X"                              
#[48] "48. The Pianist"                                     
#[49] "49. Intouchables"                                    
#[50] "50. Casablanca"

You can do some data cleaning to remove numbers and get only movie names from it.

- JayBee
- January 10, 2024 at 3:42 am
- 0 votes
0
An alternative approach here would be to use regex.
```
library(tidyverse)
library(rvest)

website <- "https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_250"

page <- read_html(website)
```
If we search for the movie titles in this returned list, we can see that they always occur after "originalTitleText":" and before another . For example, "originalTitleText":"Groundhog Day.

We can convert the list to a string, then use regex based on the above pattern to extract the movie titles.
```
collapsed_page <- paste(page, collapse = " ")

pattern <- '(?<=\"originalTitleText\":\")[^\"]+'

matches <- regmatches(collapsed_page, gregexpr(pattern, collapsed_page, perl=TRUE))
```
If you want these movies as a dataframe, you could then do:
```
df <- data.frame(match = unlist(matches))
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – New to web scraping in R – how to use the rvest package to scrape IMDB movie data?

Answers