skip to Main Content

I’m brand new to web scraping in R and not very familiar with HTML code. I’m trying to scrape data from the top 50 IMDB movies at https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_250. I know to use read_html which gives me an XML object, and then I know I need to use html_nodes to extract the movie titles. But because I’m not very familiar with html, I’m struggling to figure out what those nodes are named. Can anyone point me in the right direction?

library(rvest)
library(dplyr)
website <- "https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_250"
page <- read_html(website)
movie_titles <- page %>%
  html_nodes("node_name_here") %>%
  html_text()

2

Answers


  1. You need to use Inspect option when you right click on the webpage to get the correct node. Then you need to find the node that has the information that you need.

    enter image description here

    There are multiple options but I have used a and h3 tags along with it’s class to get correct values.

    library(rvest)
    library(dplyr)
    website <- "https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_250"
    
    page <- read_html(website)
    movie_titles <- page %>%
      html_nodes("a h3.ipc-title__text") %>%
      html_text() 
    
    movie_titles 
    
    movie_titles
    # [1] "1. The Shawshank Redemption"                         
    # [2] "2. 12th Fail"                                        
    # [3] "3. The Godfather"                                    
    # [4] "4. The Dark Knight"                                  
    # [5] "5. The Lord of the Rings: The Return of the King"    
    # [6] "6. Schindler's List"                                 
    # [7] "7. The Godfather Part II"                            
    # [8] "8. 12 Angry Men"                                     
    # [9] "9. The Lord of the Rings: The Fellowship of the Ring"
    #...
    #...
    #[47] "47. American History X"                              
    #[48] "48. The Pianist"                                     
    #[49] "49. Intouchables"                                    
    #[50] "50. Casablanca"
    

    You can do some data cleaning to remove numbers and get only movie names from it.

    Login or Signup to reply.
  2. An alternative approach here would be to use regex.

    library(tidyverse)
    library(rvest)
    
    website <- "https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_250"
    
    page <- read_html(website)
    

    If we search for the movie titles in this returned list, we can see that they always occur after "originalTitleText":" and before another . For example, "originalTitleText":"Groundhog Day.

    We can convert the list to a string, then use regex based on the above pattern to extract the movie titles.

    collapsed_page <- paste(page, collapse = " ")
    
    pattern <- '(?<=\"originalTitleText\":\")[^\"]+'
    
    matches <- regmatches(collapsed_page, gregexpr(pattern, collapsed_page, perl=TRUE))
    

    If you want these movies as a dataframe, you could then do:

    df <- data.frame(match = unlist(matches))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search