I’m trying to get the value from the main titular news from this web page
Here is my code:
news = ""
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0"
}
url = "https://elperuano.pe/"
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
#Obtener noticia principal
for div in soup.findAll('span', attrs={'class':'card-title fz18 lh30 fw500 width100'}):
print(div.text)
This is the unique span tag with has that class name "card-title fz18 lh30 fw500 width100". I don’t know why this doesn’t work.
However if try to get the value of the date of the newspaper this works:
for div in soup.findAll('div', attrs={'class':'lh18'}):
n = div.text.rstrip("nn")
I have tested many ways to get this, but seems that the webpage is locking this. Any idea to fix this problem guys I will appreciate it. Thanks so much.
2
Answers
Hei, I tried getting the full page from beautifulsoup and figured out I’m getting a skeleton version of the page, as you can see in the image. I copied all the headers from my browser and it was still the same, so I’m guessing it’s ’cause there is some functionality missing from the scrapper: no cookies, no javascript, different screen size, etc. You can try adding some of these to beautifulsoup. Or use a headless browser.
It does not work because the page content is dynamically loaded from their API. Meaning that when you access the site initially with bs4, the content is not there yet, therefore the div.text returns empty.
If you check the network log from your browser’s developer tools, you can examine which endpoints the data had been fetched from :
You can either scrape the site with libraries like selenium in which you can scrape the page with contents are loaded or get the data you need from their API.