Is there a way to scrape a page with XHR autoload? - Telegram API

tuf9k
March 28, 2023
253 views
0 votes
2 Answers

there is this site with telegram chats of neighbours in Moscow.
https://moscow.chatnovosela.ru/novostroyki
i need to scrape it and get links to every card on this site.

the trick is: cards are being appended by XHR when user is reaching the bottom of the page and requests can’t get them all. is there a way to load them all at once? i’ve done my research and found out that i can use Selenium for it somehow. where do i start?

Answers

I quess you need to something like this (any question you can ask freely, i dont know about xhr but this code can scrape the card urls):

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

main_url = "https://moscow.chatnovosela.ru/novostroyki"
driver = webdriver.Chrome(executable_path="<DRIVERPATH>/chromedriver")
driver.get(main_url)
page_source = driver.page_source

soup = BeautifulSoup(page_source, 'html.parser')
list_items = soup.find_all("div", attrs = {"class":"col-md-6 col-lg-4 col-xl-3 m-b-30"})

url_list = []

for x in range(len(list_items)):

    try:

        xpath = '//*[@id="showmore-list"]/div[' + str(x+1) + ']/div/a'

        li_item = driver.find_element(By.XPATH, xpath).get_attribute("href")
        url = { 'url' : li_item }

        url_list.append(url)

    except Exception as e:
        print(e)
        continue



print(url_list)

i’ve done my research and found out that i can use Selenium for it somehow

No need to use Selenium – it’s an overkill for this kind of task. Instead you can use simple HTTP requests to emulate the "bottom of the page" load behaviour.

Just iterate over pages in XHR requests and print found apartment URLs:

import requests
from bs4 import BeautifulSoup

HEADERS = {
    'referer': 'https://moscow.chatnovosela.ru/novostroyki',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/111.0.0.0 Safari/537.36',
}


def find_apartment_urls() -> None:
    page_number = 1
    while True:
        with requests.Session() as sess:
            # get a root page to get and save cookies necessary for other requests
            _ = sess.get('https://moscow.chatnovosela.ru/novostroyki')
            
            # prepare an XHR request
            resp = sess.post(
                'https://moscow.chatnovosela.ru/service.php',
                data=dict(
                    type='get_novostroyli_objects',  # a typo in API here
                    page=page_number,
                    city=3,
                ),
                headers=HEADERS,
            )

            # extract hrefs from XHR response; can also be done with regexp
            soup = BeautifulSoup(resp.text, "lxml")
            apartment_urls = {x.get('href') for x in soup.findAll('a')}

            # print results; check if the end is reached
            if apartment_urls:
                print(f'Apartments found on page #{page_number}: '
                      f'{", ".join(apartment_urls)}')
                page_number += 1
            else:
                print('Search is finished.')  # no data == last page is reached
                break


if __name__ == '__main__':
    find_apartment_urls()

Output:

Apartments found on page 1: https://moscow.chatnovosela.ru/object/lyublinskiy_park_2253, https://moscow.chatnovosela.ru/object/triniti, https://moscow.chatnovosela.ru/object/myakinino_park, https://moscow.chatnovosela.ru/object/kronshtadtskiy_9_2671, https://moscow.chatnovosela.ru/object/life_varshavskaya, https://moscow.chatnovosela.ru/object/d1, https://moscow.chatnovosela.ru/object/green_park_2428, https://moscow.chatnovosela.ru/object/wellton_towers, https://moscow.chatnovosela.ru/object/baltiyskiy, https://moscow.chatnovosela.ru/object/jazz, https://moscow.chatnovosela.ru/object/now_kvartal_na_naberezhnoy, https://moscow.chatnovosela.ru/object/dmitrovskiy_park_2889
Apartments found on page 2: https://moscow.chatnovosela.ru/object/sheremetevskiy, https://moscow.chatnovosela.ru/object/mihaylovskiy_park, https://moscow.chatnovosela.ru/object/stolichnye_polyany, https://moscow.chatnovosela.ru/object/volzhskiy_park_2554, https://moscow.chatnovosela.ru/object/aquatoria, https://moscow.chatnovosela.ru/object/bolshaya_ochakovskaya_2, https://moscow.chatnovosela.ru/object/river_park_3047, https://moscow.chatnovosela.ru/object/pervyy_moskovskiy, https://moscow.chatnovosela.ru/object/savelovskiy_siti_2064, https://moscow.chatnovosela.ru/object/seliger_siti, https://moscow.chatnovosela.ru/object/salarevo_park, https://moscow.chatnovosela.ru/object/lyubov_i_golubi
...

Please signup or login to give your own answer.

Click here to cancel reply.

Is there a way to scrape a page with XHR autoload? – Telegram API

Answers