i have a problem with my code. So i’m trying to web scrapp the 250 top movies in imdb. Fron this url – > https://www.imdb.com/chart/top/
The problem is that i can only extract 25 movies and i want the 250. This is my code.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import time
import re
from requests.exceptions import HTTPError
from urllib.request import urlopen
contenido = None
encabezados = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edge/101.0.1210.53",
'Accept-Language': 'en-us,en;q=0.5'
}
def rastrear_sitio_web(url: str, headers: str) -> str:
try:
respuesta = requests.get(url, headers=headers)
respuesta.raise_for_status()
except HTTPError as exc:
print(exc)
else:
return respuesta.text
URL = 'https://www.imdb.com/chart/top/'
contenido = rastrear_sitio_web(url=URL, headers=encabezados)
pagina = BeautifulSoup(contenido, 'html.parser')
contenido_extraido = []
año = [""]
ranking = [""]
titulo = [""]
nota = [""]
tiempo = [""]
rated = [""]
tabla = pagina.find('div', {'data-testid': 'chart-layout-main-column'})
peliculas = tabla.find("ul")
for pelicula in peliculas.find_all('li'):
pelicula = pelicula.get_text(";").strip().split(";")
año.append(pelicula[1])
ranking.append(pelicula[0].split(".")[0])
titulo.append(pelicula[0].split(".")[1])
nota.append(pelicula[4])
tiempo.append(pelicula[2])
rated.append(pelicula[3])
año.pop(0)
ranking.pop(0)
titulo.pop(0)
nota.pop(0)
tiempo.pop(0)
rated.pop(0)
datos = {'Ranking': ranking, 'Título': titulo, 'Año': año, 'Calificación': nota, 'Duracion':tiempo, 'Rated': rated}
print(datos)
contenido_extraido = pd.DataFrame(data=datos)
I tried changing functions and changing the classes in the html code but it doesn’t work, also i tried differente codes but they have the same problem.
2
Answers
Someone answered by you aren’t getting all of the results but you do not need to add a delay. The information lives in a script tag. Just scrape that with beautiful soup as shown below
If you print out the len() you will see 250 items.
Add additional code to make a dataframe but you can get all of the movie information by iterating through the data[‘itemListElement’] list.
Simple is mostly always better.
IMDB does not load all 250 movies on its initial request. It loads the first 25 and then makes further requests to load the later 225. In my browser I paused the page as it was loading and you can see it only has 25 movies at first, and then a few seconds later it loads the rest of them:
So your script is capturing the page contents before it’s done loading the full list. Perhaps you can add a brief delay before scanning the
ul
contents. Or, since you know you want 250 results, you could retry theul
scan a few times (maybe 1 second apart) until it returns 250 results.