For some reason this clutch.co scraper is working propperly if i run it on one site
- a. https://clutch.co/us/web-developers – the us-category: it works awesome
- b. https://clutch.co/il/web-developers – the israel-category: it does not work
So when I run this code it’ll only get information from the first page and then close itself. I added in waits to allow the page to load but it hasn’t helped. When watching the browser you can see it scrolls to the bottom of the page but then closes itself.
well this runs for me – see below: but only for the us-site, not for others, eg. the israel site: a. https://clutch.co/us/web-developers – this runs great.
b. https://clutch.co/il/web-developers – it stops and gives a whole lot of errors back.
well – it seems like sometimes there might be some issue with locating the elements with the class name ‘provider-info’: i guess that this could be due to changes in the website’s structure on the clutch.co-site or on the other handside some timing issues. I think that there a handling of potential exceptions should set in; This one works for me:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 10)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)
which gives back the following
import pandas as pd
Timeout Exception occurred while waiting for company elements.
Company_Name ... Rating
0 Hyperlink InfoSystem ... 4.9
1 Plego Technologies ... 5.0
2 Azuro Digital ... 4.9
3 Savas Labs ... 5.0
4 The Gnar Company ... 4.8
5 Sunrise Integration ... 5.0
6 Baytech Consulting ... 5.0
7 Inventive Works ... 4.9
8 Utility ... 4.8
9 Busy Human ... 5.0
10 Rootstrap ... 4.8
11 micro1 ... 4.9
12 ChopDawg.com ... 4.8
13 Emergent Software ... 4.9
14 Beehive Software Inc. ... 5.0
15 3 Media Web ... 4.9
16 Webstacks ... 5.0
17 Mutually Human ... 5.0
18 AnyforSoft ... 4.8
19 NL Softworks ... 5.0
20 OpenSource Technologies Inc. ... 4.8
21 Marcel Digital ... 4.8
22 Twin Sun ... 5.0
23 SPARK Business Works ... 4.9
24 Darwin ... 4.9
25 Perrill ... 5.0
26 Nimi ... 4.9
27 Scopic ... 4.9
28 Interactive Strategies ... 4.9
29 Unleashed Technologies ... 4.9
30 Oyova ... 4.9
31 BrandExtract ... 4.9
32 The Brick Factory ... 4.9
33 My Web Programmer ... 5.0
34 PureLogics LLC ... 4.9
35 Social Driver ... 4.9
36 Calibrate Software ... 4.9
37 VisualFizz ... 5.0
38 Camber Creative ... 4.9
39 Susco Solutions ... 4.9
40 Lunarbyte.io ... 5.0
41 thoughtbot ... 4.9
42 CR Software Solutions ... 5.0
43 Solwey Consulting ... 5.0
44 Ambaum ... 4.9
45 Pacific Codeline LLC ... 5.0
46 PERC ... 5.0
47 Beesoul LLC ... 4.9
48 Novalab Tech ... 5.0
49 Dragon Army ... 5.0
[50 rows x 5 columns]
and the following data that is stored:
Process finished with exit code 0
Company_Name,Tagline,Location,Ticket_Price,Rating,Website_Name,URL
Hyperlink InfoSystem,"#1 Mobile App, Web, & Software Development Company","Jersey City, NJ","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Plego Technologies,Shaping the Future of Technology,"Downers Grove, IL","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Azuro Digital,"Award-Winning Web Design, Development & SEO","New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
App Makers USA,Top US Mobile & Web App Development Agency,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
ChopDawg.com,Dreams Delivered Since 2009. Let's Make It App'n!®,"Philadelphia, PA","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Savas Labs,Designing and developing elegant web products.,"Raleigh, NC","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Gnar Company,Solving Gnarly Software Problems. Faster.,"Boston, MA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Sunrise Integration,Enterprise Solutions & Ecommerce Apps,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Baytech Consulting,TRANSLATING YOUR VISION INTO SOFTWARE,"Irvine, CA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Inventive Works,Custom Software Product Development,"Manor, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Utility,AWARD-WINNING MOBILE DESIGN & DEVELOPMENT AGENCY,"New York, NY","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Busy Human,Making life more user-friendly,"Orem, UT","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Rootstrap,Outcome-driven development. At any scale.,"Beverly Hills, CA","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
micro1,"World-class software engineers, powered by AI","Los Angeles, CA","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Emergent Software,Your Full-Stack Technology Partner,"Saint Paul, MN","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
3 Media Web,Award-Winning Digital Experience Agency 🏆🏆🏆,"Marlborough, MA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Beehive Software Inc.,Software reinvented,"Los Gatos, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Webstacks,"The website is a product, not a project.","San Diego, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Mutually Human,Custom Software Development and Design,"Ada, MI","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
AnyforSoft,Amplify digital excellence with AnyforSoft,"Sarasota, FL","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
NL Softworks,Website Design & Development Made to Convert,"Boston, MA","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
OpenSource Technologies Inc.,Web & Mobile APP | Digital Marketing | Cloud,"Lansdale, PA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Twin Sun,Trustworthy partners that deliver results,"Nashville, TN","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Marcel Digital,Changing the Idea of What an Agency Is And Can Be,"Chicago, IL","$5,000+",4.7,Top Web Developers in the United States,https://clutch.co/us/web-developers
Darwin,We create incredible digital experiences,"Reston, VA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
SPARK Business Works,Award-winning custom software dev & web design,"Kalamazoo, MI","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Nimi,"Bring your product ideas to life, to Grow Today.","Oakland, CA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Scopic,"Your Cross-continental, Digital Innovation Partner","Rutland, MA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Interactive Strategies,"Full Service Digital Design, Dev & Marketing","Washington, DC","$100,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Unleashed Technologies,Unleash Your Potential®,"Ellicott City, MD","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Social Driver,Experience digital with us.,"Washington, DC","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Oyova,More Business For Your Business Is Our Business.™,"Jacksonville Beach, FL","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Brick Factory,A DC-based digital agency.,"Washington, DC","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
My Web Programmer,→Top-Quality Custom Software & Web Development Co.,"Atlanta, GA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
PureLogics LLC,No Magic. Just Logic.,"New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
BrandExtract,"We inspire people to create, transform, and grow.","Houston, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Calibrate Software,We craft digital experiences that spark joy 🎉,"Chicago, IL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Camber Creative,Things worth building are worth building well.,"Orlando, FL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
VisualFizz,Impactful Marketing for Industry-Leading Brands,"Chicago, IL","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Susco Solutions,Solve Together | Developing Intuitive Software,"Harvey, LA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Lunarbyte.io,Launching big ideas with startups & enterprises,"Seattle, WA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CR Software Solutions,Innovative Digital Solutions For Your Business,"Canton, MI","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Ambaum,Ambaum is your Shopify Plus Agency,"Burien, WA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Solwey Consulting,Custom software solutions to elevate your business,"Austin, TX","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Pacific Codeline LLC,"Reliable, Experienced, 100% U.S. based.","San Clemente, CA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Novalab Tech,Your Trusted IT Partner,"San Francisco, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Dragon Army,A purpose-driven digital engagement company.,"Atlanta, GA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CodigoDelSur,Rockstar coders for rockstar companies,"Montevideo, Uruguay","$75,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Brainhub,Top 1.36% engineering team - onboarding in 10 days,"Gliwice, Poland","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Curotec,Your digital product engineering department,"Philadelphia, PA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
TekRevol,Creative Web | App | Software Development Company,"Houston, TX","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
XWP,Building a better web at enterprise scale,"New York, NY","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Five Jars,⭐️⭐️⭐️⭐️⭐️ OUTSTANDING WEB DESIGN & DEVELOPMENT,"Brooklyn, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
hmm – but wait: it does not work here – if we choose another base-url
https://clutch.co/il/web-developers
company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
Traceback (most recent call last):
File "/home/ubuntu/.config/JetBrains/PyCharmCE2023.3/scratches/scratch.py", line 74, in <module>
df = pd.DataFrame(data)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 767, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
Process finished with exit code 1
well i think that this has to do with some exceptions
import pandas as pd
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
well i think that there may be a whole couple of issues:
first of all there were some Element not found while extracting company details: This indicates that some elements were not found while extracting details for certain companies. This could be due to variations in the structure of the website or changes in the layout. I guess that we can handle this; thereofore we should include additional error handling or refine our XPath expressions.
during several trials and attempts also Timeout Exception occurred while waiting for company elements: This suggests that the script timed out while waiting for elements to load on the page.
last but not least i also had ValueError: All arrays must be of the same length: This error occurs because the arrays used to construct the DataFrame are of different lengths. This typically happens when one or more data points are not collected properly.
see below what code i used:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/il/it-services"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 20)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
websites = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
# Extracting website URL
website_element = company_element.find_element(By.XPATH, './/a[@class="website-link"]')
website_url = website_element.get_attribute('href')
websites.append(website_url)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
# Ensure all arrays have the same length
min_length = min(len(company_names), len(taglines), len(locations), len(costs), len(ratings), len(websites))
company_names = company_names[:min_length]
taglines = taglines[:min_length]
locations = locations[:min_length]
costs = costs[:min_length]
ratings = ratings[:min_length]
websites = websites[:min_length]
data = {'Company_Name': company_names, 'Tagline': taglines, 'Location': locations, 'Ticket_Price': costs, 'Rating': ratings, 'Website': websites}
df = pd.DataFrame(data)
# Check if DataFrame is empty
if not df.empty:
df.to_csv('companies_test10.csv', index=False)
print(df)
else:
print("DataFrame is empty. No data to save.")
2
Answers
Unfortunately, I don’t think scraping is a practical solution here. Use the API.
Let’s address your questions one by one:
Element not found while extracting company details
This problem is easily solvable. It’s one element that could not be found in the page, so you could just add something in its place to the lists you’re using to collect the data:
Timeout Exception occurred while waiting for company elements
Here lies you main problem.
clutch.co
uses Cloudflare, and after you make a number of requests, it starts throttling your requests and redirecting them to a captcha page. One of the reasons why they use that is precisely to prevent automated bots from collecting their data. You can read more about here.So when that happens, you get a
TimeoutException
: since it’s taking a while, selenium assumes the data won’t load and raises this exception. You could increase the time for a timeout, but that wouldn’t be practical or last long anyway.First, you need to solve a captcha for each page, which is time consuming. You could hire a service to solve that for you, but this would cost you money.
Besides, and most importantly, if you keep making automated requests through Cloudflare, they will likely add your IP to a blacklist at some point, in which case you would have to start using a proxy service. This would also cost you money.
If you really want to go that route, try using something like Cloudscraper.
ValueError: All arrays must be of the same length
This is a consequence of the previous problems. Pandas expects all the lists with the data (
company_names
,taglines
,locations
,costs
, andratings
) to be of the same length, since they are rows of a dataframe. When they are not of the same length, the error is raised.So something like this won’t work…
But this will
If you could solve the above problems and collect all the data, this error would go away as well.
Use the API
If the API provides all the data you need, I recommend you use it even if it’s paid, instead of trying to scrape the data. This will be a lot less error-prone and demand less development time. In the end you will probably be saving money.
I would handle try / except at each element level, so then you always have the same amount of results (and you can later check the ones which were None). More importantly this avoids this bad practice of slicing results (which may include shifts)
Check this snippet (untested):