Ubuntu - gathering data from clutch.io : some issues with BS4 while working on colab

malaga
June 21, 2023
197 views
0 votes
2 Answers

update: what bout selenium – support in colab: i have checked this..see below!

good day dear experts – well at the moment i am trying to figure out a simple way and method to obtain data from clutch.io

note: i work with google colab – and sometimes i think that some approches were not supported on my collab account – some due cloudflare-things and issues.

but see this one –

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

this also do not work – do you have any idea – how to solve the issue

it gives back a empty result.

update: hello dear user510170 . many thanks for the answer and the selenium solution – tried it out in google.colab and found these following results

--------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last)
<ipython-input-2-4f37092106f4> in <cell line: 4>()
      2 from selenium import webdriver
      3 
----> 4 driver = webdriver.Chrome()
      5 
      6 url = 'https://clutch.co/it-services/msp'

5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    243                 alert_text = value["alert"].get("text")
    244             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 245         raise exception_class(message, screen, stacktrace)

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x56199267a4e3 <unknown>
#1 0x5619923a9c76 <unknown>
#2 0x5619923d0757 <unknown>
#3 0x5619923cf029 <unknown>
#4 0x56199240dccc <unknown>
#5 0x56199240d47f <unknown>
#6 0x561992404de3 <unknown>
#7 0x5619923da2dd <unknown>
#8 0x5619923db34e <unknown>
#9 0x56199263a3e4 <unknown>
#10 0x56199263e3d7 <unknown>
#11 0x561992648b20 <unknown>
#12 0x56199263f023 <unknown>
#13 0x56199260d1aa <unknown>
#14 0x5619926636b8 <unknown>
#15 0x561992663847 <unknown>
#16 0x561992673243 <unknown>
#17 0x7efc5583e609 start_thread

to me it seems to have to do with the line 4 – the

  ----> 4 driver = webdriver.Chrome()

is it this line that needs a minor correction and change!?

update: thanks to tarun i got notice of this workaround here:

https://medium.com/cubemail88/automatically-download-chromedriver-for-selenium-aaf2e3fd9d81

did it: in other words i appied it to google-colab and tried to run the following:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

#if __name__ == "__main__":
        browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get("https://www.reddit.com/")
        browser.quit()

well – finally it should be able to run with this code in colab:

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

update: see below – the check in colab – and the question – is colab genearlly selenium capable and selenium-ready!?

look forward to hear from you

thanks to @user510170 who has pointed me to another approach :How can we use Selenium Webdriver in colab.research.google.com?

Recently Google collab was upgraded and since Ubuntu 20.04+ no longer distributes chromium-browser outside of a snap package, you can install a compatible version from the Debian buster repository:

Then you can run selenium like this:

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.headless = True
wd = webdriver.Chrome('chromedriver',options=chrome_options)
wd.get("https://www.webite-url.com")

cf this thread How can we use Selenium Webdriver in colab.research.google.com?

i need to try out this…. – on colab

Answers

If you do print(response.content), you will see the following: Enable JavaScript and cookies to continue. Without using JavaScript, you don’t get access to the full content. Here is a working solution based on selenium.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

url = 'https://clutch.co/it-services/msp'

driver.get(url=url)
soup = BeautifulSoup(driver.page_source,"lxml")

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links, "n", "Count links - ", len(links))

Result:

...ch.co&utm_medium=referral&utm_campaign=directory', 'https://www.turrito.com/?utm_source=clutch.co&utm_medium=referral&utm_campaign=directory'] 
 Count links -  50

TL;DR

The big guns of selenium can’t shoot the Cloudflare sheriff.

The Colab link with what’s below.

All right, here’s a working selenium on Google colab that proves my point in the comment that even if you run it, you still must deal with a Cloudflare challenge.

Do the following:

Open a new colab Notebook
Run the code below:

%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver

# Install selenium
pip install selenium

Run this code then

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")


driver = webdriver.Chrome(options=options)

driver.get("https://clutch.co/it-services/msp")
print(driver.page_source)
driver.quit()

You should see this:

<html lang="en-US" class="lang-en"><head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta name="robots" content="noindex,nofollow">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
    

<script src="/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d"></script><script src="https://challenges.cloudflare.com/turnstile/v0/b/19ad4730/api.js?onload=_cf_chl_turnstile_l&amp;render=explicit" async="" defer="" crossorigin="anonymous"></script></head>
<body class="no-js">
    <div class="main-wrapper" role="main">
    <div class="main-content">
        <h1 class="zone-name-title h1"><img src="/favicon.ico" class="heading-favicon" alt="Icon for clutch.co">clutch.co</h1><h2 id="challenge-running" class="h2">Checking if the site connection is secure</h2><div id="challenge-stage"></div><div id="challenge-spinner" class="spacer loading-spinner" style="display: block; visibility: visible;"><div class="lds-ring"><div></div><div></div><div></div><div></div></div></div><div id="challenge-body-text" class="core-msg spacer">clutch.co needs to review the security of your connection before proceeding.</div><div id="challenge-explainer-expandable" class="hidden expandable body-text spacer" style="display: none;"><div class="expandable-title" id="challenge-explainer-summary"><button class="expandable-summary-btn" id="challenge-explainer-btn" type="button">Why am I seeing this page?<span class="caret-icon-wrapper"> <div class="caret-icon"></div> </span> </button> </div> <div class="expandable-details" id="challenge-explainer-details">Requests from malicious bots can pose as legitimate traffic. Occasionally, you may see this page while the site ensures that the connection is secure.</div></div><div id="challenge-success" style="display: none;"><div class="h2"><span class="icon-wrapper"><img class="heading-icon" alt="Success icon" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADQAAAA0CAMAAADypuvZAAAANlBMVEUAAAAxMTEwMDAxMTExMTEwMDAwMDAwMDAxMTExMTExMTEwMDAwMDAxMTExMTEwMDAwMDAxMTHB9N+uAAAAEXRSTlMA3zDvfyBAEJC/n3BQz69gX7VMkcMAAAGySURBVEjHnZZbFoMgDEQJiDzVuv/NtgbtFGuQ4/zUKpeMIQbUhXSKE5l1XSn4pFWHRm/WShT1HRLWC01LGxFEVkCc30eYkLJ1Sjk9pvkw690VY6k8DWP9OM9yMG0Koi+mi8XA36NXmW0UXra4eJ3iwHfrfXVlgL0NqqGBHdqfeQhMmyJ48WDuKP81h3+SMPeRKkJcSXiLUK4XTHCjESOnz1VUXQoc6lgi2x4cI5aTQ201Mt8wHysI5fc05M5c81uZEtHcMKhxZ7iYEty1GfhLvGKpm+EYkdGxm1F5axmcB93DoORIbXfdN7f+hlFuyxtDP+sxtBnF43cIYwaZAWRgzxIoiXEMESoPlMhwLRDXeK772CAzXEdBRV7cmnoVBp0OSlyGidEzJTFq5hhcsA5388oSGM6b5p+qjpZrBlMS9xj4AwXmz108ukU1IomM3ceiW0CDwHCqp1NjAqXlFrbga+xuloQJ+tuyfbIBPNpqnmxqT7dPaOnZqBfhSBCteJAxWj58zLk2xgg+SPGYM6dRO6WczSnIxxwEExRaO+UyCUhbOp7CGQ+kxSUfNtLQFC+Po29vvy7jj4y0yAAAAABJRU5ErkJggg=="></span>Connection is secure</div><div class="core-msg spacer">Proceeding...</div></div><noscript>
            <div id="challenge-error-title">
                <div class="h2">
                    <span class="icon-wrapper">
                        <div class="heading-icon warning-icon"></div>
                    </span>
                    <span id="challenge-error-text">
                        Enable JavaScript and cookies to continue
                    </span>
                </div>
            </div>
        </noscript>
        <div id="trk_jschal_js" style="display:none;background-image:url('/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=7daf435aeecd112d')"></div>
        <form id="challenge-form" action="/it-services/msp?__cf_chl_f_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" method="POST" enctype="application/x-www-form-urlencoded">
            <input type="hidden" name="md" value="Y531CK3.GDorU7Iwk2DV6cF23mTV48icLTjuAZV6568-1687382086-0-AY7Fiv3qUhkh_i93AsbTcYh_D3SG2ZegyiWzGIVG8NgRvrQkiLAuCZ_x8rfr_A4Wy5QOyAOrBLs-avkoeJD0_1G3AYtVfv9rIc6umkp5J_y75TurwQH5fCwjSC3biYbFJbdTbW_NeKfRDUQgh230Lb1UMApiygfWXkeMlzznEEKUa3EXALaHU6co68L5nf_vY6c9QyyILeTdhcspjfUkXCUIUB7ff-8QQgCKpUkZa3UH9V9Icbndie4LGMCl_QJsy5jPvIzTt7nAS_Kk1-TrPxZltr8ZyHhjdvhEyVTkyrTi46auFGmixnyt9bK5dKnGv-J59nXp3EMF34gnVnbmTQuMDG9KHaN4bR4Ij6IO94sRnDGIJnXX6aiFLHiqFx9_kh1krAg3qOuuXZ9UghjKoITy2uPx9ng7hZ73p6QILb0aW-f-GL4VBdv-f1mdZyXJYRRrlfpnGoQMy-jxy6zsZshYtI-fzuDAL3A7nU_NVEGoN7SRrS4dFdn2mGhwPwVhhzt37SQ04MMjfs-_r8KNkOVbnNBtfHp_TWwyEbrhM4Lgc-YEYVRrI-J5LVYwIv4K7JAgObKJffhs53zwB0RrFQG3pF2Qy9W8Cxq2HvlKko3clzUXmw6meZfYJPZaYIMbJa39rqF0jltNKoqOcgJa5xQSTSXrNShUO1ClAHsjUGuTA11lM8Dk5rlnS9qXVWhDWI51i-4Q7BPIkb1BqaW6K_0ltyCzXBtN8q1EqrJeno7ryMC1FyCZ2y8Hy0IsHAhNg2DAvhYov34mrEeoOc4iG4ZHZghGAPkf9tNXo5NBTVNbrwDzvwxXaMVWJRHYQ8YB6LiFK7VPWa_ZjEU7GsdWzXpa_Tp4ulnnbUGrdEThXQC3chCij4f3T7m-Pc7LZdTvs-qs2f5g6_kBwiAAro2KelOxhCsf66l5HcpHHy9uhERBx7FgItODQDqG7kR2r80QCo3kOzBqFL3CIsvtg_KYNG8HkxYqDc-YMRWsvBj5Mmt6c8RzCOkDxKC_DJwOj58CeC2o9e-6wCfgcjb0EPR8cTK_S8ht28zPLUCDJ_j119ErBnHJ1zpdJHydT1HEdnK-vaSuyYf69kOSCC7Kij4ZRttSlfiA4k9gau8QoREht_pxMwfxXraBRfYUWVXO_ZSyz561B9C4Fa1L0gW31RXgCRuzCdDg-Cgr9AN8ky06s19D3N4CZLhtGOjRfMbidHVBD9Ppe4jlcUnSx-wdkJkVXZ2S8XO4F4ou7jGhrN9l9mDIDZ98OXaL_CvhHXNBWxE1Gn1_i1_Ndb7VKFP5Y6YuPLTXaN9kS-kF3rZcIBuh_dczTVQKOEWq1QYy9_CBj2sIPSxhcuQCXwTt4K81e6UiIrovBNWiZ4VjKvLdetwmUUgnpfNbssOz5S6GieV7ENqMBdaYlIP9YPdzHdJl4WQ_stCiC_Yc0wew2XI2XvOOil8_7F1yHgCg4mPS98Y9BXNDKiLDGGl3lRs9ydBvCdiY8__KztFLuVyDiWqschUvXUOg07KBtyQDnSxOyZUn873i7Kg4dKoqAyUICRT_nhsNtGUe4wzXYk3eevEG-7Ct4tSBpw6rTrjeNqa9Lsu5b6Pv-eJX0gYpg-1pydKSKLfvQYNp9wjwT-Oh5UH8vw8lo7b3uSc6QMmkaP2jQVDnqIyQDN8cDAYu6Vdr83xiZJG1Qqn80xVe0RMwEzMcjFv7yy6QM3O-uv0tJHC8EnINpXc1uMp1zphYyIgw-xSy68x55DEf38OrsY7xbJUqdMdF_qJQPi3FOh5MYHftgyH1WyDUHrxXiVJYuTMv7DtgaLjGoA0ybDW_PcBOXI5LAXnqYYR92WmHTEghxLHKxpWqZt9t_XS4j4rycqHU261_6zPhkTklv2cUFJOOT5lRTkY3OySP7-CEp0ZgjPrAOu4g-wt1YUprDjQzYrpmlBUXqKXzeJ795UBKn0HZLDoGQkY5_w-deyzcLV4XZXdGrxnAEOQq5Kx330hD2XgH8Q0be4WinLLZ6R8Tsl3c_5UuxLn0YJlxosFgXXLZehemg9WxGzfrOnb_5reyNr_3KU4nYWl9wFy-wsz6HtyPQ_1LnvBBgxVbrCFy-m9Wm8mt1BcaLwTUA2NSpTY0fbSwkuvx0LKTmG865H5C9qqBAgTGw2R99fv6vqq6ZP_HOzv5Q-c5L2C17lCp4cJwOCkvj7NEWQ1iCoi0X6CWZtVYFC-wXeuI4dh2D6BxGtekFuC77-Rt335ib1wPN7bf6_lA-TPb92U2IUCoq8K9frexCE7QzxaCSdKB-wRkE6g5FERuP-waii1Uquiut4aQ8tJVlwvi0nvuOvuP_Rg4P9xa2HlOxzwajrDBmzfnerhwdyEzOYQTXvwDF-ApPg5rPiMpjo29icy0K9arOF9yY_Wf2EXZD-6hjCcDswfhO3lQWFfnf1ANOFvnp0hcCvr-k93ukAnVbm8uorhSIWr2iy1JeNeGH8kM66IDkdlSLnj9igHNf6C0vDnkBoOolfXQECpmfhS6dai7Np01RQjoKsoGQU1S4rQnjsdYxBdOXqdrfYw_wfsBhV87qxHGUND6uD6m3qwU2vKCyQa_GSIGgzPfqWnhpXozyHUbmBOYJDiKI6u0x3u8mZDWhaaQWYttxUa1gQKnOQy1qM5NI8D881kXI_M2cpvX4rW9coG1k9_qE7yC--4u537ojssm9gSzNnQgeOpn-___N978hwxMqftej1jdhMJePK959TjaeJvMu045n-xtFbFGF81FIhiKtMWskbvRy1wIB3I">
        <span style="display: none;"><span class="text-gray-600" data-translate="error">error code: 1020</span></span></form>
    </div>
</div>
<script>
    (function(){
        window._cf_chl_opt={
            cvId: '2',
            cZone: 'clutch.co',
            cType: 'managed',
            cNounce: '44156',
            cRay: '7daf435aeecd112d',
            cHash: 'f86e351e5e00345',
            cUPMDTk: "/it-services/msp?__cf_chl_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA",
            cFPWv: 'b',
            cTTimeMs: '1000',
            cMTimeMs: '0',
            cTplV: 5,
            cTplB: 'cf',
            cK: "",
            cRq: {
                ru: 'aHR0cHM6Ly9jbHV0Y2guY28vaXQtc2VydmljZXMvbXNw',
                ra: 'TW96aWxsYS81LjAgKFgxMTsgTGludXggeDg2XzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBIZWFkbGVzc0Nocm9tZS85MC4wLjQ0MzAuMjEyIFNhZmFyaS81MzcuMzY=',
                rm: 'R0VU',
                d: 'Ac34gEYVhl8DXbnILOq76p8yhzcHr06ria7SjaltDZ17DDHJrhCowkieLnLjzsxr3IgprB+0nJObDfv3tbOFZfQanW8VrnMBqy2JC8EFTBSXy7ra08EgPGOSUetaRr/bENIZ81mt06Vq52ykJX01fCO0wyHdNMat8fNwgF9RDfp7CFMpUtp0E+lofrj9tut74nR1+yniOo1zFt2zmKVpFFUunX1K1oMy8Fp1ubIQgHIBEG8g8h3CRzHD2WMTRtqYfFvCfD5PhcR+uWWgxf6ybQnii3noC7BLSbJZHZ5abVjNKZTvRGyLtkP8uNLoAQTF8A5ir68vmv+c6weSVw845TjogSfOFzHrXQvj5dnpPWEmReEsQfl2p3nJJuswyd/OUIPTMuLfPOM7EYHQKawKqI1+jp15e4QZjAl4LIhAwQoHqqcXPd9NqvBkzxrb7YhWBsvOHzgUMb5gR3exN42NVnFbUimWWdhX7Ei+tXR43I+68kGLFe4kQccvXzfYtl3G7mudbXvhkFMjAJk24bb9ugax1RyJeT1HMXZAZG7vOzGxEpf2Zgly+6twZ+C1JShkmfbHj9Z8EkYIlkxm99wVFg==',
                t: 'MTY4NzM4MjA4Ni44NzAwMDA=',
                cT: Math.floor(Date.now() / 1000),
                m: 'eRBgvpMHb6ottjHZ8LYOdoe7cvhlOKe5j2vP7BjQYIE=',
                i1: 'DrqvOBUgqLvl22W0Yoh8VA==',
                i2: 'Co7rIFnUzVj/9LmqAUCUUw==',
                zh: 'MYPZaDt93/n+i/zoik8Q5B4rNo75M88ZQHevg31AJek=',
                uh: 'U3QjejX60yUnAxm0WjPwFsHXm0FG5VD2yNoc1w8iQek=',
                hh: 'w+icDAWoSjxex064a5CZutpetBiSACwcZG4EmfuqjNI=',
            }
        };
        var trkjs = document.createElement('img');
        trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d');
        trkjs.setAttribute('alt', '');
        trkjs.setAttribute('style', 'display: none');
        document.body.appendChild(trkjs);
        var cpo = document.createElement('script');
        cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d';
        window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;
        window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;
        if (window.history && window.history.replaceState) {
            var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
            history.replaceState(null, null, "/it-services/msp?__cf_chl_rt_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" + window._cf_chl_opt.cOgUHash);
            cpo.onload = function() {
                history.replaceState(null, null, ogU);
            };
        }
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
</script><img src="/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d" alt="" style="display: none">




<div class="footer" role="contentinfo"><div class="footer-inner"><div class="clearfix diagnostic-wrapper"><div class="ray-id">Ray ID: <code>7daf435aeecd112d</code></div></div><div class="text-center" id="footer-text">Performance &amp; security by <a rel="noopener noreferrer" href="https://www.cloudflare.com?utm_source=challenge&amp;utm_campaign=m" target="_blank">Cloudflare</a></div></div></div><span id="trk_jschal_js"></span></body></html>

As you can see, running selenium doesn’t change much.

So, my question to you is:

Why do you want to stick to colab so badly?

Because, running a slightly modified code locally:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = webdriver.EdgeOptions()
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
browser = webdriver.Edge(options=options)

browser.get("https://clutch.co/it-services/msp")

print("Waiting for download links to appear...")
WebDriverWait(browser, 5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".infobar__counter"))
)

css_selector = ".directory-list div.provider-info--header .company_info a"
download_links = [
    link.get_attribute("href") for link
    in browser.find_elements(By.CSS_SELECTOR, css_selector)
]

print(download_links)

Should open a browser window (this time is Edge) and return this:

['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']

On the other hand, locally, you don’t even need selenium if you have cloudscraper.

For example, this:

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()
source = scraper.get("https://clutch.co/it-services/msp")

css_selector = ".directory-list div.provider-info--header .company_info a"

links = [
    f'https://clutch.co{anchor["href"]}' for anchor in
    BeautifulSoup(source.text, "html.parser").select(css_selector)
]
print(links)

Should return:

['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']

PS. The source for the Debian magic on colab is here.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – gathering data from clutch.io : some issues with BS4 while working on colab

Answers