skip to Main Content

there is this site with telegram chats of neighbours in Moscow.
https://moscow.chatnovosela.ru/novostroyki
i need to scrape it and get links to every card on this site.

the trick is: cards are being appended by XHR when user is reaching the bottom of the page and requests can’t get them all. is there a way to load them all at once? i’ve done my research and found out that i can use Selenium for it somehow. where do i start?

2

Answers


  1. I quess you need to something like this (any question you can ask freely, i dont know about xhr but this code can scrape the card urls):

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from bs4 import BeautifulSoup
    
    main_url = "https://moscow.chatnovosela.ru/novostroyki"
    driver = webdriver.Chrome(executable_path="<DRIVERPATH>/chromedriver")
    driver.get(main_url)
    page_source = driver.page_source
    
    soup = BeautifulSoup(page_source, 'html.parser')
    list_items = soup.find_all("div", attrs = {"class":"col-md-6 col-lg-4 col-xl-3 m-b-30"})
    
    url_list = []
    
    for x in range(len(list_items)):
    
        try:
    
            xpath = '//*[@id="showmore-list"]/div[' + str(x+1) + ']/div/a'
    
            li_item = driver.find_element(By.XPATH, xpath).get_attribute("href")
            url = { 'url' : li_item }
    
            url_list.append(url)
    
        except Exception as e:
            print(e)
            continue
    
    
    
    print(url_list)
    
    Login or Signup to reply.
  2. i’ve done my research and found out that i can use Selenium for it somehow

    No need to use Selenium – it’s an overkill for this kind of task. Instead you can use simple HTTP requests to emulate the "bottom of the page" load behaviour.

    Just iterate over pages in XHR requests and print found apartment URLs:

    import requests
    from bs4 import BeautifulSoup
    
    HEADERS = {
        'referer': 'https://moscow.chatnovosela.ru/novostroyki',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/111.0.0.0 Safari/537.36',
    }
    
    
    def find_apartment_urls() -> None:
        page_number = 1
        while True:
            with requests.Session() as sess:
                # get a root page to get and save cookies necessary for other requests
                _ = sess.get('https://moscow.chatnovosela.ru/novostroyki')
                
                # prepare an XHR request
                resp = sess.post(
                    'https://moscow.chatnovosela.ru/service.php',
                    data=dict(
                        type='get_novostroyli_objects',  # a typo in API here
                        page=page_number,
                        city=3,
                    ),
                    headers=HEADERS,
                )
    
                # extract hrefs from XHR response; can also be done with regexp
                soup = BeautifulSoup(resp.text, "lxml")
                apartment_urls = {x.get('href') for x in soup.findAll('a')}
    
                # print results; check if the end is reached
                if apartment_urls:
                    print(f'Apartments found on page #{page_number}: '
                          f'{", ".join(apartment_urls)}')
                    page_number += 1
                else:
                    print('Search is finished.')  # no data == last page is reached
                    break
    
    
    if __name__ == '__main__':
        find_apartment_urls()
    

    Output:

    Apartments found on page 1: https://moscow.chatnovosela.ru/object/lyublinskiy_park_2253, https://moscow.chatnovosela.ru/object/triniti, https://moscow.chatnovosela.ru/object/myakinino_park, https://moscow.chatnovosela.ru/object/kronshtadtskiy_9_2671, https://moscow.chatnovosela.ru/object/life_varshavskaya, https://moscow.chatnovosela.ru/object/d1, https://moscow.chatnovosela.ru/object/green_park_2428, https://moscow.chatnovosela.ru/object/wellton_towers, https://moscow.chatnovosela.ru/object/baltiyskiy, https://moscow.chatnovosela.ru/object/jazz, https://moscow.chatnovosela.ru/object/now_kvartal_na_naberezhnoy, https://moscow.chatnovosela.ru/object/dmitrovskiy_park_2889
    Apartments found on page 2: https://moscow.chatnovosela.ru/object/sheremetevskiy, https://moscow.chatnovosela.ru/object/mihaylovskiy_park, https://moscow.chatnovosela.ru/object/stolichnye_polyany, https://moscow.chatnovosela.ru/object/volzhskiy_park_2554, https://moscow.chatnovosela.ru/object/aquatoria, https://moscow.chatnovosela.ru/object/bolshaya_ochakovskaya_2, https://moscow.chatnovosela.ru/object/river_park_3047, https://moscow.chatnovosela.ru/object/pervyy_moskovskiy, https://moscow.chatnovosela.ru/object/savelovskiy_siti_2064, https://moscow.chatnovosela.ru/object/seliger_siti, https://moscow.chatnovosela.ru/object/salarevo_park, https://moscow.chatnovosela.ru/object/lyubov_i_golubi
    ...
    
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search