Ubuntu - running bs4 scraper needs to be redefined to enrich the dataset - some issues

thannen
February 3, 2024
197 views
0 votes
2 Answers

got a bs4 scraper that works with selenium – see far below:

well – it works fine so far:

see far below my approach to fetch some data form the given page: clutch.co/il/it-services

To enrich the scraped data, with additional information, i tried to modify the scraping-logic to extract more details from each company’s page. Here’s i have to an updated version of the code that extracts the company’s website and additional information:

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_info = soup.select(".directory-list div.provider-info")

data_list = []
for info in company_info:
    company_name = info.select_one(".company_info a").get_text(strip=True)
    location = info.select_one(".locality").get_text(strip=True)
    website = info.select_one(".company_info a")["href"]
    
    # Additional information you want to extract goes here
    # For example, you can extract the description
    description = info.select_one(".description").get_text(strip=True)
    
    data_list.append({
        "Company Name": company_name,
        "Location": location,
        "Website": website,
        "Description": description
    })

df = pd.DataFrame(data_list)
df.index += 1

print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data_enriched.csv", index=False)

driver.quit()

ideas to this extended version: well in this code, I added a loop to go through each company’s information, extracted the website, and added a placeholder for additional information (in this case, the description). i thougth that i can adapt this loop to extract more data as needed. At least this is the idea.

the working model: i think that the structure of the HTML of course changes here – and therefore in need to adapt the scraping-logik: so i think that i might need to adjust the CSS selectors accordingly based on the current structure of the page. So far so good: Well,i think that we need to make sure to customize the scraping logic based on the specific details we want to extract from each company’s page. Conclusio: well i think i am very close: but see what i gotten back: the following

/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/bin/python /home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py
/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
       
 import pandas as pd
Traceback (most recent call last):
 File "/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py", line 29, in <module>
   description = info.select_one(".description").get_text(strip=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_text'

Process finished with exit code

and now – see below my allready working model: my approach to fetch some data form the given page: clutch.co/il/it-services

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")

company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]

data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)

driver.quit()

import pandas as pd

+----+-----------------------------------------------------+--------------------------------+
|    | Company Name                                        | Location                       |
|----+-----------------------------------------------------+--------------------------------|
|  1 | Artelogic                                           | L'viv, Ukraine                 |
|  2 | Iron Forge Development                              | Palm Beach Gardens, FL         |
|  3 | Lionwood.software                                   | L'viv, Ukraine                 |
|  4 | Greelow                                             | Tel Aviv-Yafo, Israel          |
|  5 | Ester Digital                                       | Tel Aviv-Yafo, Israel          |
|  6 | Nextly                                              | Vitória, Brazil                |
|  7 | Rootstack                                           | Austin, TX                     |
|  8 | Novo                                                | Dallas, TX                     |
|  9 | Scalo                                               | Tel Aviv-Yafo, Israel          |
| 10 | TLVTech                                             | Herzliya, Israel               |
| 11 | Dofinity                                            | Bnei Brak, Israel              |
| 12 | PURPLE                                              | Petah Tikva, Israel            |
| 13 | Insitu S2 Tikshuv LTD                               | Haifa, Israel                  |
| 14 | Opinov8 Technology Services                         | London, United Kingdom         |
| 15 | Sogo Services                                       | Tel Aviv-Yafo, Israel          |
| 16 | Naviteq LTD                                         | Tel Aviv-Yafo, Israel          |
| 17 | BMT - Business Marketing Tools                      | Ra'anana, Israel               |
| 18 | Profisea                                            | Hod Hasharon, Israel           |
| 19 | MeteorOps                                           | Tel Aviv-Yafo, Israel          |
| 20 | Trivium Solutions                                   | Herzliya, Israel               |
| 21 | Dynomind.tech                                       | Jerusalem, Israel              |
| 22 | Madeira Data Solutions                              | Kefar Sava, Israel             |
| 23 | Titanium Blockchain                                 | Tel Aviv-Yafo, Israel          |
| 24 | Octopus Computer Solutions                          | Tel Aviv-Yafo, Israel          |
| 25 | Reblaze                                             | Tel Aviv-Yafo, Israel          |
| 26 | ELPC Networks Ltd                                   | Rosh Haayin, Israel            |
| 27 | Taldor                                              | Holon, Israel                  |
| 28 | Clarity                                             | Petah Tikva, Israel            |
| 29 | Opsfleet                                            | Kfar Bin Nun, Israel           |
| 30 | Hozek Technologies Ltd.                             | Petah Tikva, Israel            |
| 31 | ERG Solutions                                       | Ramat Gan, Israel              |
| 32 | Komodo Consulting                                   | Ra'anana, Israel               |
| 33 | SCADAfence                                          | Ramat Gan, Israel              |
| 34 | Ness Technologies | נס טכנולוגיות                         | Tel Aviv-Yafo, Israel          |
| 35 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel          |
| 36 | Radware                                             | Tel Aviv-Yafo, Israel          |
| 37 | BigData Boutique                                    | Rishon LeTsiyon, Israel        |
| 38 | NetNUt                                              | Tel Aviv-Yafo, Israel          |
| 39 | Asperii                                             | Petah Tikva, Israel            |
| 40 | PractiProject                                       | Ramat Gan, Israel              |
| 41 | K8Support                                           | Bnei Brak, Israel              |
| 42 | Odix                                                | Rosh Haayin, Israel            |
| 43 | Panaya                                              | Hod Hasharon, Israel           |
| 44 | MazeBolt Technologies                               | Giv'atayim, Israel             |
| 45 | Porat                                               | Tel Aviv-Jaffa, Israel         |
| 46 | MindU                                               | Tel Aviv-Yafo, Israel          |
| 47 | Valinor Ltd.                                        | Petah Tikva, Israel            |
| 48 | entrypoint                                          | Modi'in-Maccabim-Re'ut, Israel |
| 49 | Adelante                                            | Tel Aviv-Yafo, Israel          |
| 50 | Code n' Roll                                        | Haifa, Israel                  |
| 51 | Linnovate                                           | Bnei Brak, Israel              |
| 52 | Viceman Agency                                      | Tel Aviv-Jaffa, Israel         |
| 53 | develeap                                            | Tel Aviv-Yafo, Israel          |
| 54 | Chalir.com                                          | Binyamina-Giv'at Ada, Israel   |
| 55 | WolfCode                                            | Rishon LeTsiyon, Israel        |
| 56 | Penguin Strategies                                  | Ra'anana, Israel               |
| 57 | ANG Solutions                                       | Tel Aviv-Yafo, Israel          |
+----+-----------------------------------------------------+--------------------------------+

what is aimed: i want to to fetch some more data form the given page: clutch.co/il/it-services – eg the website and so on…

update_: The error AttributeError: ‘NoneType’ object has no attribute ‘get_text’ indicates that the .select_one(".description") method did not find any HTML element with the class ".description" for the current company information, resulting in None. Therefore, calling .get_text(strip=True) on None raises an AttributeError.

more to follow… later the day.

update2:
note: @jakob had a interesting idea – posted here: Selenium in Google Colab without having to worry about managing the ChromeDriver executable – i tried an example using kora.selenium
I made Google-Colab-Selenium to solve this problem. It manages the executable and the required Selenium Options for you. – well that sounds very very interesting – at the moment i cannot imagine that we get selenium working on colab in such a way – that the above mentioned scraper works on colab full and well!? – ideas !? would be awesome:

Jakob: the real issue is that the website you are trying to scrape is using CloudFlare, which can detect selenium.
I wrote a little code to scrape the data that you were looking for.
You actually don’t need to use Selenium as the data is already baked right into the HTML when you go to the webpage.

https://colab.research.google.com/drive/1qkZ1OV_Nqeg13UY3S9pY0IXuB4-q3Mvx?usp=sharing

%pip install -q curl_cffi
%pip install -q fake-useragent
%pip install -q lxml

from curl_cffi import requests
from fake_useragent import UserAgent
# we need to take care for this: https://pypi.org/project/fake-useragent/

ua = UserAgent()    
headers = {'User-Agent': ua.safari}
resp = requests.get('https://clutch.co/il/it-services', headers=headers, impersonate="safari15_3")
resp.status_code


# I like to use this to verify the contents of the request
from IPython.display import HTML

HTML(resp.text)

from lxml.html import fromstring

tree = fromstring(resp.text)

data = []

for company in tree.xpath('//ul/li[starts-with(@id, "provider")]'):
    data.append({
        "name": company.xpath('./@data-title')[0].strip(),
        "location": company.xpath('.//span[@class = "locality"]')[0].text,
        "wage": company.xpath('.//div[@data-content = "<i>Avg. hourly rate</i>"]/span/text()')[0].strip(),
        "min_project_size": company.xpath('.//div[@data-content = "<i>Min. project size</i>"]/span/text()')[0].strip(),
        "employees": company.xpath('.//div[@data-content = "<i>Employees</i>"]/span/text()')[0].strip(),
        "description": company.xpath('.//blockquote//p')[0].text,
        "website_link": (company.xpath('.//a[contains(@class, "website-link__item")]/@href') or ['Not Available'])[0],
    })


import pandas as pd
from pandas import json_normalize
df = json_normalize(data, max_level=0)
df

that said – well i think that i understand the approach – fetching the HTML and then working with xpath the thing i have difficulties is the user-agent .. part

it works awesome – it is just overwhelming…!!!

Answers

- TimWolfe
- February 2, 2024 at 3:35 am
- 0 votes
0
The root cause of the error AttributeError: ‘NoneType’ object has no attribute ‘get_text’ in your BeautifulSoup and Selenium scraper is that the .select_one(".description") method is attempting to find an element with the class .description that does not exist in some of the company sections of the webpage. When no element is found, select_one returns None, and calling .get_text() on None leads to the AttributeError. To fix this, you should add a conditional check to ensure the element exists before attempting to access its text content.

Login or Signup to reply.

- eternal_white
- February 3, 2024 at 9:15 am
- 0 votes
0
TL;DR

Change this line:
```
description = info.select_one(".description").get_text(strip=True)
```
to this:
```
description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)
```
This will find the tags that has description in their class, regardless it’s a single class name or it’s a part of a whole class name.

Explanation

I’m not an expert in beautifulsoup, I really encourage not using it if you’re already dealing with Selenium (selection with selenium is WAY EASIER if you learn XPATH). Anyways, only one modification is needed in your code to work:

It’s this line:
```
description = info.select_one(".description").get_text(strip=True)
```
should be like this:
```
description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)
```
You original code had this:
```
info.select_one(".description")
```
This will try to find an element in the info element you found. There’s always a div that has this class: col-md-3 provider-info__description.

COOL! we found the element, but bs4 didn’t find it. That’s because the .select_one and select functions will split the classes into a list.

So the class we’ve seen earlier would look like this:
['col-md-3', 'provider-info__description']

If you want to test it yourself, try this code:
```
for i in company_info[0].find_all("div"):
    print(i['class'])
```
This will print all the classes for all div tags it will find. You’ll see ['col-md-3', 'provider-info__description'] at the bottom.

I don’t know why you’re using .select and .select_one, I usually use .find and .find_all (it needs the tag name, and you can specify classes and other attributes instead of a css selector).

So you could either replace all .select to .find_all or you would only replace it in this situation (my solution).

OK, back to the solution. So, let’s see the new line of code again:
```
description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)
```
This line will look for all the div tags that are inside your info element. Then, it’ll only select the ones that have description inside their class.
```
''.join(i['class'])][0].get_text(strip=True)
```
NOTE: if you’re confused with the syntax of this part, this is called a Python comprehension. See here.

The join part will combine all the class names in one string, so we don’t need to find description as a separate class, but we only wanna know if description is there or not.

This solution should almost always work.

Hope it helps!
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – running bs4 scraper needs to be redefined to enrich the dataset – some issues

Answers

TL;DR

Explanation