skip to Main Content

i have a problem with my code. So i’m trying to web scrapp the 250 top movies in imdb. Fron this url – > https://www.imdb.com/chart/top/

The problem is that i can only extract 25 movies and i want the 250. This is my code.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import time
import re
from requests.exceptions import HTTPError
from urllib.request import urlopen
contenido = None
encabezados = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edge/101.0.1210.53",
    'Accept-Language': 'en-us,en;q=0.5'
}
def rastrear_sitio_web(url: str, headers: str) -> str:
    try:
        respuesta = requests.get(url, headers=headers)
        respuesta.raise_for_status()
    except HTTPError as exc:
        print(exc)
    else:
        return respuesta.text

URL = 'https://www.imdb.com/chart/top/'
contenido = rastrear_sitio_web(url=URL, headers=encabezados)
pagina = BeautifulSoup(contenido, 'html.parser')
contenido_extraido = []
año = [""]
ranking = [""]
titulo = [""]
nota = [""]
tiempo = [""]
rated = [""]

tabla = pagina.find('div', {'data-testid': 'chart-layout-main-column'})

peliculas = tabla.find("ul")

for pelicula in peliculas.find_all('li'):
    pelicula = pelicula.get_text(";").strip().split(";")
    año.append(pelicula[1])
    ranking.append(pelicula[0].split(".")[0])
    titulo.append(pelicula[0].split(".")[1])
    nota.append(pelicula[4])
    tiempo.append(pelicula[2])
    rated.append(pelicula[3])

año.pop(0)

ranking.pop(0)


titulo.pop(0)


nota.pop(0)


tiempo.pop(0)

rated.pop(0)

datos = {'Ranking': ranking, 'Título': titulo, 'Año': año, 'Calificación': nota, 'Duracion':tiempo, 'Rated': rated}
print(datos)
contenido_extraido = pd.DataFrame(data=datos)

I tried changing functions and changing the classes in the html code but it doesn’t work, also i tried differente codes but they have the same problem.

2

Answers


  1. Someone answered by you aren’t getting all of the results but you do not need to add a delay. The information lives in a script tag. Just scrape that with beautiful soup as shown below
    If you print out the len() you will see 250 items.

    import requests
    from bs4 import BeautifulSoup
    import json
    
    headers = {'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Safari/537.36', 'Accept-Language':'en-US;q=0.5,en;q=0.3', 'Cache-Control': 'max-age=0', 'Upgrade-Insecure-Requests': '1'}
    
    response = requests.get('https://www.imdb.com/chart/top/', headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    data = json.loads(soup.find('script', {'type':"application/ld+json"}).text)
    
    print(len(data['itemListElement']))
    for item in data['itemListElement'][:5]:
        print(item['item']['name'])
    

    Add additional code to make a dataframe but you can get all of the movie information by iterating through the data[‘itemListElement’] list.

    Simple is mostly always better.

    Login or Signup to reply.
  2. IMDB does not load all 250 movies on its initial request. It loads the first 25 and then makes further requests to load the later 225. In my browser I paused the page as it was loading and you can see it only has 25 movies at first, and then a few seconds later it loads the rest of them:

    web page only has 25 movies on initial load

    So your script is capturing the page contents before it’s done loading the full list. Perhaps you can add a brief delay before scanning the ul contents. Or, since you know you want 250 results, you could retry the ul scan a few times (maybe 1 second apart) until it returns 250 results.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search