Why is HTML Loop only returning data from one page instead of multiple pages?

AkhileshDesai
May 19, 2023
211 views
0 votes
2 Answers

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}

questionlist = []

url = "https://seekingalpha.com/market-news?page=20"

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')

questions = soup.find_all('article', {'class': 'mT-jA ga-jA Q-b8 R-cS R-df ks-IX R-cG R-dJ ks-IX R-cG R-dJ ks-I0 ks-I0 mT-NM'})

for page in range(1, 10):
    for item in questions:
        question = {
        'title': item.find('h3', {'class': 'km-X R-cw Q-cs km-IM V-gT V-g9 V-hj km-IO V-hY V-ib V-ip km-II R-fZ'}).text,
        'link': 'https://seekingalpha.com/market-news' + item.find('a', {'class': 'hq-ox R-fu'})['href'],
        'date': item.find('span', {'class': 'mU-uO mU-gE'}),
        }
        questionlist.append(question)
    
print(questionlist)

why my loop is not working i am scrapping for multiple pages but output is coming for single page multiple times

Answers

The pagination is implemented with request to external URL via JavaScript (so beautifulsoup doesn’t see the new pages). To simulate this request you can do for example:

import requests

api_url = "https://seekingalpha.com/api/v3/news"

params = {
    "filter[category]": "market-news::all",
    "filter[since]": "0",
    "filter[until]": "0",
    "include": "author,primaryTickers,secondaryTickers",
    "isMounting": "true",
    "page[size]": 25,
    "page[number]": 22,
}

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0",
    'Referer': 'https://seekingalpha.com/market-news?page=22'
}

with requests.session() as s:
    s.headers = headers
    # set cookies
    s.get('https://seekingalpha.com/market-news')

    for p in range(1, 5):  # <-- increase this range for more pages
        params['page[number]'] = p

        data = s.get(api_url, params=params).json()
        # print sample data
        for d in data["data"]:
            print(d["attributes"]["title"])

Prints:


...

Thermo Fisher tests for preeclampsia risk gets FDA nod
QCR Holdings declares $0.06 dividend
RingCentral repurchases ~$461M senior notes
Amphenol slips as Credit Suisse downgrades on 'weakness' in certain markets
Innovid receives NYSE notice on non-compliance
Investors were net buyers of fund assets for the fourth consecutive week, adding $4.6B
4 stocks to watch on Friday: Deere, Applied Materials and more
TIO, PHIO and ALIM among pre-market losers

...

- Monco
- May 19, 2023 at 11:09 pm
- 0 votes
0
As has been said, you need to have a way to tell your loop to actually scrape different web pages. For example having all the links in a list, or updating the link each time when the update is a simple page number change, or telling your code to press a button to change page.

I recommend you to go through following link to learn more. Specifically part titled
How to Scrape Multiple Web Pages
Web Scraping tutorial

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.