skip to Main Content

update: what bout selenium – support in colab: i have checked this..see below!

good day dear experts – well at the moment i am trying to figure out a simple way and method to obtain data from clutch.io

note: i work with google colab – and sometimes i think that some approches were not supported on my collab account – some due cloudflare-things and issues.

but see this one –

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

this also do not work – do you have any idea – how to solve the issue

it gives back a empty result.

update: hello dear user510170 . many thanks for the answer and the selenium solution – tried it out in google.colab and found these following results

--------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last)
<ipython-input-2-4f37092106f4> in <cell line: 4>()
      2 from selenium import webdriver
      3 
----> 4 driver = webdriver.Chrome()
      5 
      6 url = 'https://clutch.co/it-services/msp'

5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    243                 alert_text = value["alert"].get("text")
    244             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 245         raise exception_class(message, screen, stacktrace)

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x56199267a4e3 <unknown>
#1 0x5619923a9c76 <unknown>
#2 0x5619923d0757 <unknown>
#3 0x5619923cf029 <unknown>
#4 0x56199240dccc <unknown>
#5 0x56199240d47f <unknown>
#6 0x561992404de3 <unknown>
#7 0x5619923da2dd <unknown>
#8 0x5619923db34e <unknown>
#9 0x56199263a3e4 <unknown>
#10 0x56199263e3d7 <unknown>
#11 0x561992648b20 <unknown>
#12 0x56199263f023 <unknown>
#13 0x56199260d1aa <unknown>
#14 0x5619926636b8 <unknown>
#15 0x561992663847 <unknown>
#16 0x561992673243 <unknown>
#17 0x7efc5583e609 start_thread

to me it seems to have to do with the line 4 – the

  ----> 4 driver = webdriver.Chrome()

is it this line that needs a minor correction and change!?

update: thanks to tarun i got notice of this workaround here:

https://medium.com/cubemail88/automatically-download-chromedriver-for-selenium-aaf2e3fd9d81

did it: in other words i appied it to google-colab and tried to run the following:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

#if __name__ == "__main__":
        browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get("https://www.reddit.com/")
        browser.quit()

well – finally it should be able to run with this code in colab:

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

update: see below – the check in colab – and the question – is colab genearlly selenium capable and selenium-ready!?

enter image description here

look forward to hear from you

thanks to @user510170 who has pointed me to another approach :How can we use Selenium Webdriver in colab.research.google.com?

Recently Google collab was upgraded and since Ubuntu 20.04+ no longer distributes chromium-browser outside of a snap package, you can install a compatible version from the Debian buster repository:

Then you can run selenium like this:

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.headless = True
wd = webdriver.Chrome('chromedriver',options=chrome_options)
wd.get("https://www.webite-url.com")

cf this thread How can we use Selenium Webdriver in colab.research.google.com?

i need to try out this…. – on colab

2

Answers


  1. If you do print(response.content), you will see the following: Enable JavaScript and cookies to continue. Without using JavaScript, you don’t get access to the full content. Here is a working solution based on selenium.

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    
    url = 'https://clutch.co/it-services/msp'
    
    driver.get(url=url)
    soup = BeautifulSoup(driver.page_source,"lxml")
    
    links = []
    for l in soup.find_all('li',class_='website-link website-link-a'):
        results = (l.a.get('href'))
        links.append(results)
    
    print(links, "n", "Count links - ", len(links))
    

    Result:

    ...ch.co&utm_medium=referral&utm_campaign=directory', 'https://www.turrito.com/?utm_source=clutch.co&utm_medium=referral&utm_campaign=directory'] 
     Count links -  50
    
    Login or Signup to reply.
  2. TL;DR

    The big guns of selenium can’t shoot the Cloudflare sheriff.

    The Colab link with what’s below.


    All right, here’s a working selenium on Google colab that proves my point in the comment that even if you run it, you still must deal with a Cloudflare challenge.

    Do the following:

    • Open a new colab Notebook
    • Run the code below:
    %%shell
    # Ubuntu no longer distributes chromium-browser outside of snap
    #
    # Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap
    
    # Add debian buster
    cat > /etc/apt/sources.list.d/debian.list <<'EOF'
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
    deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
    EOF
    
    # Add keys
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
    
    apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
    apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
    apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg
    
    # Prefer debian repo for chromium* packages only
    # Note the double-blank lines between entries
    cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
    Package: *
    Pin: release a=eoan
    Pin-Priority: 500
    
    
    Package: *
    Pin: origin "deb.debian.org"
    Pin-Priority: 300
    
    
    Package: chromium*
    Pin: origin "deb.debian.org"
    Pin-Priority: 700
    EOF
    
    # Install chromium and chromium-driver
    apt-get update
    apt-get install chromium chromium-driver
    
    # Install selenium
    pip install selenium
    
    • Run this code then
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    
    
    driver = webdriver.Chrome(options=options)
    
    driver.get("https://clutch.co/it-services/msp")
    print(driver.page_source)
    driver.quit()
    

    You should see this:

    <html lang="en-US" class="lang-en"><head>
        <title>Just a moment...</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <meta name="robots" content="noindex,nofollow">
        <meta name="viewport" content="width=device-width,initial-scale=1">
        <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
        
    
    <script src="/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d"></script><script src="https://challenges.cloudflare.com/turnstile/v0/b/19ad4730/api.js?onload=_cf_chl_turnstile_l&amp;render=explicit" async="" defer="" crossorigin="anonymous"></script></head>
    <body class="no-js">
        <div class="main-wrapper" role="main">
        <div class="main-content">
            <h1 class="zone-name-title h1"><img src="/favicon.ico" class="heading-favicon" alt="Icon for clutch.co">clutch.co</h1><h2 id="challenge-running" class="h2">Checking if the site connection is secure</h2><div id="challenge-stage"></div><div id="challenge-spinner" class="spacer loading-spinner" style="display: block; visibility: visible;"><div class="lds-ring"><div></div><div></div><div></div><div></div></div></div><div id="challenge-body-text" class="core-msg spacer">clutch.co needs to review the security of your connection before proceeding.</div><div id="challenge-explainer-expandable" class="hidden expandable body-text spacer" style="display: none;"><div class="expandable-title" id="challenge-explainer-summary"><button class="expandable-summary-btn" id="challenge-explainer-btn" type="button">Why am I seeing this page?<span class="caret-icon-wrapper"> <div class="caret-icon"></div> </span> </button> </div> <div class="expandable-details" id="challenge-explainer-details">Requests from malicious bots can pose as legitimate traffic. Occasionally, you may see this page while the site ensures that the connection is secure.</div></div><div id="challenge-success" style="display: none;"><div class="h2"><span class="icon-wrapper"><img class="heading-icon" alt="Success icon" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADQAAAA0CAMAAADypuvZAAAANlBMVEUAAAAxMTEwMDAxMTExMTEwMDAwMDAwMDAxMTExMTExMTEwMDAwMDAxMTExMTEwMDAwMDAxMTHB9N+uAAAAEXRSTlMA3zDvfyBAEJC/n3BQz69gX7VMkcMAAAGySURBVEjHnZZbFoMgDEQJiDzVuv/NtgbtFGuQ4/zUKpeMIQbUhXSKE5l1XSn4pFWHRm/WShT1HRLWC01LGxFEVkCc30eYkLJ1Sjk9pvkw690VY6k8DWP9OM9yMG0Koi+mi8XA36NXmW0UXra4eJ3iwHfrfXVlgL0NqqGBHdqfeQhMmyJ48WDuKP81h3+SMPeRKkJcSXiLUK4XTHCjESOnz1VUXQoc6lgi2x4cI5aTQ201Mt8wHysI5fc05M5c81uZEtHcMKhxZ7iYEty1GfhLvGKpm+EYkdGxm1F5axmcB93DoORIbXfdN7f+hlFuyxtDP+sxtBnF43cIYwaZAWRgzxIoiXEMESoPlMhwLRDXeK772CAzXEdBRV7cmnoVBp0OSlyGidEzJTFq5hhcsA5388oSGM6b5p+qjpZrBlMS9xj4AwXmz108ukU1IomM3ceiW0CDwHCqp1NjAqXlFrbga+xuloQJ+tuyfbIBPNpqnmxqT7dPaOnZqBfhSBCteJAxWj58zLk2xgg+SPGYM6dRO6WczSnIxxwEExRaO+UyCUhbOp7CGQ+kxSUfNtLQFC+Po29vvy7jj4y0yAAAAABJRU5ErkJggg=="></span>Connection is secure</div><div class="core-msg spacer">Proceeding...</div></div><noscript>
                <div id="challenge-error-title">
                    <div class="h2">
                        <span class="icon-wrapper">
                            <div class="heading-icon warning-icon"></div>
                        </span>
                        <span id="challenge-error-text">
                            Enable JavaScript and cookies to continue
                        </span>
                    </div>
                </div>
            </noscript>
            <div id="trk_jschal_js" style="display:none;background-image:url('/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=7daf435aeecd112d')"></div>
            <form id="challenge-form" action="/it-services/msp?__cf_chl_f_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" method="POST" enctype="application/x-www-form-urlencoded">
                <input type="hidden" name="md" value="Y531CK3.GDorU7Iwk2DV6cF23mTV48icLTjuAZV6568-1687382086-0-AY7Fiv3qUhkh_i93AsbTcYh_D3SG2ZegyiWzGIVG8NgRvrQkiLAuCZ_x8rfr_A4Wy5QOyAOrBLs-avkoeJD0_1G3AYtVfv9rIc6umkp5J_y75TurwQH5fCwjSC3biYbFJbdTbW_NeKfRDUQgh230Lb1UMApiygfWXkeMlzznEEKUa3EXALaHU6co68L5nf_vY6c9QyyILeTdhcspjfUkXCUIUB7ff-8QQgCKpUkZa3UH9V9Icbndie4LGMCl_QJsy5jPvIzTt7nAS_Kk1-TrPxZltr8ZyHhjdvhEyVTkyrTi46auFGmixnyt9bK5dKnGv-J59nXp3EMF34gnVnbmTQuMDG9KHaN4bR4Ij6IO94sRnDGIJnXX6aiFLHiqFx9_kh1krAg3qOuuXZ9UghjKoITy2uPx9ng7hZ73p6QILb0aW-f-GL4VBdv-f1mdZyXJYRRrlfpnGoQMy-jxy6zsZshYtI-fzuDAL3A7nU_NVEGoN7SRrS4dFdn2mGhwPwVhhzt37SQ04MMjfs-_r8KNkOVbnNBtfHp_TWwyEbrhM4Lgc-YEYVRrI-J5LVYwIv4K7JAgObKJffhs53zwB0RrFQG3pF2Qy9W8Cxq2HvlKko3clzUXmw6meZfYJPZaYIMbJa39rqF0jltNKoqOcgJa5xQSTSXrNShUO1ClAHsjUGuTA11lM8Dk5rlnS9qXVWhDWI51i-4Q7BPIkb1BqaW6K_0ltyCzXBtN8q1EqrJeno7ryMC1FyCZ2y8Hy0IsHAhNg2DAvhYov34mrEeoOc4iG4ZHZghGAPkf9tNXo5NBTVNbrwDzvwxXaMVWJRHYQ8YB6LiFK7VPWa_ZjEU7GsdWzXpa_Tp4ulnnbUGrdEThXQC3chCij4f3T7m-Pc7LZdTvs-qs2f5g6_kBwiAAro2KelOxhCsf66l5HcpHHy9uhERBx7FgItODQDqG7kR2r80QCo3kOzBqFL3CIsvtg_KYNG8HkxYqDc-YMRWsvBj5Mmt6c8RzCOkDxKC_DJwOj58CeC2o9e-6wCfgcjb0EPR8cTK_S8ht28zPLUCDJ_j119ErBnHJ1zpdJHydT1HEdnK-vaSuyYf69kOSCC7Kij4ZRttSlfiA4k9gau8QoREht_pxMwfxXraBRfYUWVXO_ZSyz561B9C4Fa1L0gW31RXgCRuzCdDg-Cgr9AN8ky06s19D3N4CZLhtGOjRfMbidHVBD9Ppe4jlcUnSx-wdkJkVXZ2S8XO4F4ou7jGhrN9l9mDIDZ98OXaL_CvhHXNBWxE1Gn1_i1_Ndb7VKFP5Y6YuPLTXaN9kS-kF3rZcIBuh_dczTVQKOEWq1QYy9_CBj2sIPSxhcuQCXwTt4K81e6UiIrovBNWiZ4VjKvLdetwmUUgnpfNbssOz5S6GieV7ENqMBdaYlIP9YPdzHdJl4WQ_stCiC_Yc0wew2XI2XvOOil8_7F1yHgCg4mPS98Y9BXNDKiLDGGl3lRs9ydBvCdiY8__KztFLuVyDiWqschUvXUOg07KBtyQDnSxOyZUn873i7Kg4dKoqAyUICRT_nhsNtGUe4wzXYk3eevEG-7Ct4tSBpw6rTrjeNqa9Lsu5b6Pv-eJX0gYpg-1pydKSKLfvQYNp9wjwT-Oh5UH8vw8lo7b3uSc6QMmkaP2jQVDnqIyQDN8cDAYu6Vdr83xiZJG1Qqn80xVe0RMwEzMcjFv7yy6QM3O-uv0tJHC8EnINpXc1uMp1zphYyIgw-xSy68x55DEf38OrsY7xbJUqdMdF_qJQPi3FOh5MYHftgyH1WyDUHrxXiVJYuTMv7DtgaLjGoA0ybDW_PcBOXI5LAXnqYYR92WmHTEghxLHKxpWqZt9t_XS4j4rycqHU261_6zPhkTklv2cUFJOOT5lRTkY3OySP7-CEp0ZgjPrAOu4g-wt1YUprDjQzYrpmlBUXqKXzeJ795UBKn0HZLDoGQkY5_w-deyzcLV4XZXdGrxnAEOQq5Kx330hD2XgH8Q0be4WinLLZ6R8Tsl3c_5UuxLn0YJlxosFgXXLZehemg9WxGzfrOnb_5reyNr_3KU4nYWl9wFy-wsz6HtyPQ_1LnvBBgxVbrCFy-m9Wm8mt1BcaLwTUA2NSpTY0fbSwkuvx0LKTmG865H5C9qqBAgTGw2R99fv6vqq6ZP_HOzv5Q-c5L2C17lCp4cJwOCkvj7NEWQ1iCoi0X6CWZtVYFC-wXeuI4dh2D6BxGtekFuC77-Rt335ib1wPN7bf6_lA-TPb92U2IUCoq8K9frexCE7QzxaCSdKB-wRkE6g5FERuP-waii1Uquiut4aQ8tJVlwvi0nvuOvuP_Rg4P9xa2HlOxzwajrDBmzfnerhwdyEzOYQTXvwDF-ApPg5rPiMpjo29icy0K9arOF9yY_Wf2EXZD-6hjCcDswfhO3lQWFfnf1ANOFvnp0hcCvr-k93ukAnVbm8uorhSIWr2iy1JeNeGH8kM66IDkdlSLnj9igHNf6C0vDnkBoOolfXQECpmfhS6dai7Np01RQjoKsoGQU1S4rQnjsdYxBdOXqdrfYw_wfsBhV87qxHGUND6uD6m3qwU2vKCyQa_GSIGgzPfqWnhpXozyHUbmBOYJDiKI6u0x3u8mZDWhaaQWYttxUa1gQKnOQy1qM5NI8D881kXI_M2cpvX4rW9coG1k9_qE7yC--4u537ojssm9gSzNnQgeOpn-___N978hwxMqftej1jdhMJePK959TjaeJvMu045n-xtFbFGF81FIhiKtMWskbvRy1wIB3I">
            <span style="display: none;"><span class="text-gray-600" data-translate="error">error code: 1020</span></span></form>
        </div>
    </div>
    <script>
        (function(){
            window._cf_chl_opt={
                cvId: '2',
                cZone: 'clutch.co',
                cType: 'managed',
                cNounce: '44156',
                cRay: '7daf435aeecd112d',
                cHash: 'f86e351e5e00345',
                cUPMDTk: "/it-services/msp?__cf_chl_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA",
                cFPWv: 'b',
                cTTimeMs: '1000',
                cMTimeMs: '0',
                cTplV: 5,
                cTplB: 'cf',
                cK: "",
                cRq: {
                    ru: 'aHR0cHM6Ly9jbHV0Y2guY28vaXQtc2VydmljZXMvbXNw',
                    ra: 'TW96aWxsYS81LjAgKFgxMTsgTGludXggeDg2XzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBIZWFkbGVzc0Nocm9tZS85MC4wLjQ0MzAuMjEyIFNhZmFyaS81MzcuMzY=',
                    rm: 'R0VU',
                    d: 'Ac34gEYVhl8DXbnILOq76p8yhzcHr06ria7SjaltDZ17DDHJrhCowkieLnLjzsxr3IgprB+0nJObDfv3tbOFZfQanW8VrnMBqy2JC8EFTBSXy7ra08EgPGOSUetaRr/bENIZ81mt06Vq52ykJX01fCO0wyHdNMat8fNwgF9RDfp7CFMpUtp0E+lofrj9tut74nR1+yniOo1zFt2zmKVpFFUunX1K1oMy8Fp1ubIQgHIBEG8g8h3CRzHD2WMTRtqYfFvCfD5PhcR+uWWgxf6ybQnii3noC7BLSbJZHZ5abVjNKZTvRGyLtkP8uNLoAQTF8A5ir68vmv+c6weSVw845TjogSfOFzHrXQvj5dnpPWEmReEsQfl2p3nJJuswyd/OUIPTMuLfPOM7EYHQKawKqI1+jp15e4QZjAl4LIhAwQoHqqcXPd9NqvBkzxrb7YhWBsvOHzgUMb5gR3exN42NVnFbUimWWdhX7Ei+tXR43I+68kGLFe4kQccvXzfYtl3G7mudbXvhkFMjAJk24bb9ugax1RyJeT1HMXZAZG7vOzGxEpf2Zgly+6twZ+C1JShkmfbHj9Z8EkYIlkxm99wVFg==',
                    t: 'MTY4NzM4MjA4Ni44NzAwMDA=',
                    cT: Math.floor(Date.now() / 1000),
                    m: 'eRBgvpMHb6ottjHZ8LYOdoe7cvhlOKe5j2vP7BjQYIE=',
                    i1: 'DrqvOBUgqLvl22W0Yoh8VA==',
                    i2: 'Co7rIFnUzVj/9LmqAUCUUw==',
                    zh: 'MYPZaDt93/n+i/zoik8Q5B4rNo75M88ZQHevg31AJek=',
                    uh: 'U3QjejX60yUnAxm0WjPwFsHXm0FG5VD2yNoc1w8iQek=',
                    hh: 'w+icDAWoSjxex064a5CZutpetBiSACwcZG4EmfuqjNI=',
                }
            };
            var trkjs = document.createElement('img');
            trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d');
            trkjs.setAttribute('alt', '');
            trkjs.setAttribute('style', 'display: none');
            document.body.appendChild(trkjs);
            var cpo = document.createElement('script');
            cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d';
            window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;
            window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;
            if (window.history && window.history.replaceState) {
                var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
                history.replaceState(null, null, "/it-services/msp?__cf_chl_rt_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" + window._cf_chl_opt.cOgUHash);
                cpo.onload = function() {
                    history.replaceState(null, null, ogU);
                };
            }
            document.getElementsByTagName('head')[0].appendChild(cpo);
        }());
    </script><img src="/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d" alt="" style="display: none">
    
    
    
    
    <div class="footer" role="contentinfo"><div class="footer-inner"><div class="clearfix diagnostic-wrapper"><div class="ray-id">Ray ID: <code>7daf435aeecd112d</code></div></div><div class="text-center" id="footer-text">Performance &amp; security by <a rel="noopener noreferrer" href="https://www.cloudflare.com?utm_source=challenge&amp;utm_campaign=m" target="_blank">Cloudflare</a></div></div></div><span id="trk_jschal_js"></span></body></html>
    

    As you can see, running selenium doesn’t change much.


    So, my question to you is:

    Why do you want to stick to colab so badly?

    Because, running a slightly modified code locally:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    
    options = webdriver.EdgeOptions()
    options.add_argument("--window-size=1920x1080")
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-extensions")
    browser = webdriver.Edge(options=options)
    
    browser.get("https://clutch.co/it-services/msp")
    
    print("Waiting for download links to appear...")
    WebDriverWait(browser, 5).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, ".infobar__counter"))
    )
    
    css_selector = ".directory-list div.provider-info--header .company_info a"
    download_links = [
        link.get_attribute("href") for link
        in browser.find_elements(By.CSS_SELECTOR, css_selector)
    ]
    
    print(download_links)
    

    Should open a browser window (this time is Edge) and return this:

    ['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']
    

    On the other hand, locally, you don’t even need selenium if you have cloudscraper.

    For example, this:

    import cloudscraper
    from bs4 import BeautifulSoup
    
    scraper = cloudscraper.create_scraper()
    source = scraper.get("https://clutch.co/it-services/msp")
    
    css_selector = ".directory-list div.provider-info--header .company_info a"
    
    links = [
        f'https://clutch.co{anchor["href"]}' for anchor in
        BeautifulSoup(source.text, "html.parser").select(css_selector)
    ]
    print(links)
    

    Should return:

    ['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']
    

    PS. The source for the Debian magic on colab is here.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search