How to scrape the categories belonging to the datasets with BeautifulSoup? - Html

luthierz
March 4, 2023
194 views
0 votes
2 Answers

I webscraped a site which has an url such as this: https://takipcimerkezi.net/services

I tried to get every information of the table except "aciklama"

This is my code :

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

url='https://takipcimerkezi.net/services'
page= requests.get(url)
table=BeautifulSoup(page.content, 'html.parser')

max_sipariş= table.find_all(attrs={"data-label":"Maksimum Sipariş"})
maxsiparis=[]
for i in max_sipariş:
    value=i.text
    
    maxsiparis.append(value)
min_sipariş= table.find_all(attrs={"data-label":"Minimum Sipariş"})
minsiparis=[]
for i in min_sipariş:
    value=i.text
    
minsiparis.append(value)
bin_adet_fiyati= table.find_all(attrs={"data-label":"1000 adet fiyatı "})
binadetfiyat=[]
for i in bin_adet_fiyati:
    value=i.text.strip()
    binadetfiyat.append(value)

id= table.find_all(attrs={"data-label":"ID"})
idlist=[]
for i in id:
    value=i.text
    idlist.append(value)

servis= table.find_all(attrs={"data-label":"Servis"})
servislist=[]
for i in servis:
    value=i.text
    servislist.append(value)

Then i took the values and put them into a excel sheet like this:

But, the last thing i need is, i need to add a new column for which category a row is in.

Eg: Row with the id:"158" is in the "Önerilen Servisler" category. Likewise id:"4","1526","1","1494"... and so on until id:"1537" this row need to be in " Instagram %100 Gerçek Premium Servisler" category.

I hope i explained the problem well how can i do such job ?

Answers

To add parent category column to the dataframe you can use next example:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://takipcimerkezi.net/services"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for tr in soup.select("tr:not(:has(td[colspan], th))"):
    prev = tr.find_previous("td", attrs={"colspan": True})
    tds = [td.get_text(strip=True) for td in tr.select("td")]
    all_data.append([prev.get_text(strip=True), *tds[:5]])

df = pd.DataFrame(
    all_data,
    columns=["Parent", "ID", "Servis", "1000 adet fiyatı", "Minimum Sipariş", "Maksimum Sipariş"],
)
print(df.head())
df.to_csv("data.csv", index=False)

Prints:

               Parent    ID                                                                                                              Servis 1000 adet fiyatı Minimum Sipariş Maksimum Sipariş
0  Önerilen Servisler   158      3613-🙂 Instagram Garantili Takipçi | Max 3M | Ömür Boyu Garantili | Düşüş Çok Az | Anlık Başlar | Günde 150K 🔥         13.17 TL             100          3000000
1  Önerilen Servisler     4  1495-🙂 Instagram Garantili Takipçi | Max 1M | 365 Gün Telafi Garantili | Hızlı Başlar | 30 Gün Telafi Butonu Aktif         12.07 TL              50          5000000
2  Önerilen Servisler  1526            4513-🙂 Instagram Takipçi | Max 500K | Yabancı Gerçek Kullanıcılar | Düşme Az | Anlık Başlar | Günde 250K         22.28 TL           10000           500000
3  Önerilen Servisler     1            3033-🙂 Instagram Türk Takipçi | Max 25K | %90 Türk 🇹🇷 | İptal Butonu Aktif | Anlık Başlar | Saatte 1K-2K         21.49 TL              10            25000
4  Önerilen Servisler  1494         991-🙂 Instagram Çekilişle Takipçi | %100 Organik Türk 🇹🇷 | Max 10K | Günlük İşleme Alınır | Günde 5K Atar !         37.50 TL            1000            10000

and saves data.csv (screenshot from LibreOffice):

EDIT: Little bit explanation of code above:

First I select all data row (rows that don’t contain table header or cells with colspan= attribute (the data in this row will become our "Parent" column). This is done with CSS selector "tr:not(:has(td[colspan], th))"
When iterating over these data rows, I need to know what is the "Parent". For this I use tr.find_previous("td", attrs={"colspan": True}) which will select <td> with the colspan= attribute.
I get all text from the <td> tags in this row and store it inside all_data list
From this list I create a pandas DataFrame

Simply adapt the approach from last post and scrape the categories first to map them while scraping the data:

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))

Example

from bs4 import BeautifulSoup
import pandas as pd
import requests

url='https://takipcimerkezi.net/services'

soup = BeautifulSoup(
        requests.get(
            url,
            cookies={'user_currency':'27d210f1c3ff7fe5d18b5b41f9b8bb351dd29922d175e2a144af68924e3064d1a%3A2%3A%7Bi%3A0%3Bs%3A13%3A%22user_currency%22%3Bi%3A1%3Bs%3A3%3A%22EUR%22%3B%7D;'}
        ).text
       )

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu a[data-filter-category-name]'))

data =  []

for e in soup.select('#service-tbody tr:has([data-label="Minimum Sipariş"])'):
    d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
    d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
    data.append(d)
 
pd.DataFrame(data)[['ID',  'category', 'Servis', '1000 adet fiyatı', 'Minimum Sipariş','Maksimum Sipariş']]

Output

	ID	category	Servis	1000 adet fiyatı	Minimum Sipariş	Maksimum Sipariş
0	158	Önerilen Servisler	3613-🙂 Instagram Garantili Takipçi \| Max 3M \| Ömür Boyu Garantili \| Düşüş Çok Az \| Anlık Başlar\| Günde 150K 🔥	≈ 0.6573 €	100	3000000
1	4	Önerilen Servisler	1495-🙂 Instagram Garantili Takipçi \| Max 1M \| 365 Gün Telafi Garantili \| Hızlı Başlar \| 30 Gün Telafi Butonu Aktif	≈ 0.6024 €	50	5000000
…
1326	1039	Spotify Türk Dinlenme 🇹🇷	1833-⬆️ Spotify Premium Türk Dinlenme \| 5K Tek Paket \| Normal	≈ 4.9778 €	5000	5000
1327	1040	Spotify Türk Dinlenme 🇹🇷	1834-⬆️ Spotify Premium Türk Dinlenme \| 10K Tek Paket \| Normal	≈ 4.9778 €	10000	10000

Please signup or login to give your own answer.

Click here to cancel reply.

How to scrape the categories belonging to the datasets with BeautifulSoup? – Html

Answers

Example

Output