Html - extract hidden links from web page

mohamedsultan
December 31, 2023
194 views
0 votes
2 Answers

please check this link https://maroof.sa/businesses.

it is a link for website from which i want to extract links.

for example if you scroll down you would find a name for store "Marwa store"
if you click on this card this will redirect you to the store page

now i need to scrap all the links for stores in the page " https://maroof.sa/businesses "

after inspection i found it hidden

i have successful extract the store name
but i cant find the link

thanks in advance

import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium import webdriver
from scrapy import Selector
import csv
driver = webdriver.Chrome()
driver.get(url="https://maroof.sa/businesses")
html = driver.page_source
names = driver.find_elements(By.CSS_SELECTOR , 'div.storeCard')

Answers

- DanielWallenborn
- December 31, 2023 at 2:53 pm
- 0 votes
0
I would write this as a comment but I lack the reputation to do so:
The store locks me out as a bot. Probably due to my IP which is clearly not from that region according to the written language.
So I would assume others here run in to the same problem when trying to look at the page.

Can you provide an example object/its entire source code?

Due to the aggressive security measures I would assume that the links are obfuscated with javascript and might additionally be loaded later.

Login or Signup to reply.

It’s impossible to get business details from card info, however, it can be build by getting data from request with url part business/search .

Business link can be built by pattern {url}/details/{id} where id can be got from response json object items.

You can get needed response by using Chrome Dev Tools Protocol that is now available in Selenium.

Also site has anti-scrapping mechanism, it doesn’t load every time for me, so you need to use proxy / Undetected Selenium / etc. I added some stealth chrome options, but it doesn’t help every time to avoid bot detection mechanism (site thinks that I’m a bot even in regular browser, so I think their bot detection is broken).

import json
import time

from selenium import webdriver

options = webdriver.ChromeOptions()
options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})

def enable_stealth():
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-gpu")
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--disable-dev-shm-usage')
    options.add_experimental_option("useAutomationExtension", False)
    options.add_argument("--enable-javascript")
    options.add_argument("--enable-cookies")
    options.add_argument('--disable-web-security')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])

enable_stealth()
driver = webdriver.Chrome(options)
url = "https://maroof.sa/businesses"
driver.get(url)
logs = driver.get_log("performance")
time.sleep(5)
target_url = 'business/search'

def get_links():
    for log in logs:
        message = log["message"]
        if "Network.responseReceived" not in message:
            continue
        params = json.loads(message)["message"].get("params")
        if params is None:
            continue
        response = params.get("response")
        if response is None or target_url not in response["url"]:
            continue
        body = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': params["requestId"]})
        items = json.loads(body['body'])['items']
        for item in items:
            link = f"{url}/details/{item['id']}"
            print(link)

get_links()

Please signup or login to give your own answer.

Click here to cancel reply.

Html – extract hidden links from web page

Answers