skip to Main Content

please check this link https://maroof.sa/businesses.

it is a link for website from which i want to extract links.

for example if you scroll down you would find a name for store "Marwa store"
if you click on this card this will redirect you to the store page

now i need to scrap all the links for stores in the page " https://maroof.sa/businesses "

after inspection i found it hidden

i have successful extract the store name
but i cant find the link

thanks in advance

import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium import webdriver
from scrapy import Selector
import csv
driver = webdriver.Chrome()
driver.get(url="https://maroof.sa/businesses")
html = driver.page_source
names = driver.find_elements(By.CSS_SELECTOR , 'div.storeCard')

2

Answers


  1. I would write this as a comment but I lack the reputation to do so:
    The store locks me out as a bot. Probably due to my IP which is clearly not from that region according to the written language.
    So I would assume others here run in to the same problem when trying to look at the page.

    Can you provide an example object/its entire source code?

    Due to the aggressive security measures I would assume that the links are obfuscated with javascript and might additionally be loaded later.

    Login or Signup to reply.
  2. It’s impossible to get business details from card info, however, it can be build by getting data from request with url part business/search .

    Business link can be built by pattern {url}/details/{id} where id can be got from response json object items.

    You can get needed response by using Chrome Dev Tools Protocol that is now available in Selenium.

    Also site has anti-scrapping mechanism, it doesn’t load every time for me, so you need to use proxy / Undetected Selenium / etc. I added some stealth chrome options, but it doesn’t help every time to avoid bot detection mechanism (site thinks that I’m a bot even in regular browser, so I think their bot detection is broken).

    import json
    import time
    
    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
    
    def enable_stealth():
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-gpu")
        options.add_argument('--disable-blink-features=AutomationControlled')
        options.add_argument('--disable-dev-shm-usage')
        options.add_experimental_option("useAutomationExtension", False)
        options.add_argument("--enable-javascript")
        options.add_argument("--enable-cookies")
        options.add_argument('--disable-web-security')
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
    
    enable_stealth()
    driver = webdriver.Chrome(options)
    url = "https://maroof.sa/businesses"
    driver.get(url)
    logs = driver.get_log("performance")
    time.sleep(5)
    target_url = 'business/search'
    
    def get_links():
        for log in logs:
            message = log["message"]
            if "Network.responseReceived" not in message:
                continue
            params = json.loads(message)["message"].get("params")
            if params is None:
                continue
            response = params.get("response")
            if response is None or target_url not in response["url"]:
                continue
            body = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': params["requestId"]})
            items = json.loads(body['body'])['items']
            for item in items:
                link = f"{url}/details/{item['id']}"
                print(link)
    
    get_links()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search