How to scrape related category of element using BeautifulSoup? - Html

luthierz
March 9, 2023
302 views
0 votes
2 Answers

I am trying to webscrape this site https://bulkfollows.com/services
What I want is to get every service row which has features like this: 'ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category' I got everything except category column a category column is a parent feature which is like these :

" YouTube - Watch Time By Length" or "Instagram - Followers [ From  ✓VERIFIED ACCOUNTS]"

This is my code :

from bs4 import BeautifulSoup
import pandas as pd
import requests

url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml") 

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))

data= []
for e in soup.select("#serviceList tr:has(td)"):    
    d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
    d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
    data.append(d)

pd.DataFrame(data)[['ID',  'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category']]

I need some help in the for loop for getting parent columns
this is my output :

I want the category column not none and the description when you click for example in first service I want it to be:

Link: https://youtube.com/video Start: 0-12hrs Speed: 100-200 Per day Refill: 30 days

Please Note: Watch time will take 1-3 days to update on analytics.
After 3 days of delivery, if the watch time does not update, please
take a screenshot of your video analytic ( Not the Monetization page,
we don’t guarantee Monetization ) and upload it to prntscr.com and
send it us the uploaded screenshot ).

Answers

The reason your Category column only has None values is because the elements that soup.select("#serviceList tr:has(td)") finds do NOT have the css attribute data-filter-table-category-id. The elements its finding are like this:

<tr class="">
 <td class="service-id">
  7365
 </td>
 <td class="service-name">
  YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮�                 𝐥𝐢𝐭𝐲  ]
 </td>
 <td class="service-rate">
  $4.80
 </td>
 <td class="service-min-max">
  100 / 120000
 </td>
 <td class="">
  <span class="badge gurantee">
   Refill 30 days
  </span>
 </td>
 <td class="average-time ser-id-7365">
  63 hours 40 minutes
 </td>
 <td class="text-right service-description">
  <a class="btn btn-sm btn-info" data-target="#description-7365" data-toggle="modal" href="javascript:void(0);">
   <i class="mdi mdi-information">
   </i>
   Details
  </a>
  <!-- Modal -->
  <div aria-hidden="true" aria-labelledby="description7365Label" class="modal fade text-left" id="description-7365" role="dialog" tabindex="-1">
   <div class="modal-dialog" role="document">
    <div class="modal-content">
     <div class="modal-header">
      <h5 class="modal-title" id="description7365Label">
       YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝,                𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]'s Description
      </h5>
      <button aria-label="Close" class="close" data-dismiss="modal" type="button">
       <span aria-hidden="true">
        ×
       </span>
      </button>
     </div>
     <div class="modal-body">
      <p style="line-height: 20px;">
       Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
       <br/>
       Start: Instant - 0 hrs
       <br/>
       Speed: 500-2k/day
       <br/>
       Refill: 30 days
       <br/>
       <br/>
       Drop: 0- 5% drop.
      </p>
     </div>
     <div class="modal-footer">
      <button class="btn btn-primary" data-dismiss="modal" type="button">
       <i class="mdi mdi-close">
       </i>
       Close
      </button>
     </div>
    </div>
   </div>
  </div>
 </td>
</tr>

From what I have deciphered from your post, you want to create a table similar to the ones on bulkfollows.com except for 3 main differences:

Your table will be the aggregate of the tables on the website
Your table will contain an additional column–Category–(which will contain the Service category IDs???)
Your table’s Description column will contain the text hidden behind the purple Details buttons.

Yourself or someone else can figure out the precise solution to your problem; I will merely point you in the right direction.

General Approach:

First collect of the HTML elements that make up the individual tables. These are the div elements with the classes col-lg-12 mb-3 ser-row.

tables = soup.select('div.col-lg-12.mb-3.ser-row')

Secondly iterate over the list of elements.

Then in each iteration:

use the same logic in your code. That is, create a dictionary with the current table’s column names and values as the keys and values, respectively.
Get the value of the css attribute data-filter-table-category-id. Create a new key, Category, and assign the css attr’s value to it.
Combine the dict’s into a DataFrame (as you did in your code).

There is no one fits all approach for scraping – So you have to select your elements more specific, may check the docs for some finding strategies.

Replace the line:

d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None

with following, that will take a look to previous <h4> to grab the Category and to the next modal to get the Description:

d['Category'] = e.find_previous('h4').get_text(strip=True)
d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)

Example

from bs4 import BeautifulSoup
import pandas as pd
import requests

url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml") 

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))

data= []
for e in soup.select("#serviceList tr:has(td)"):    
    d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
    d['Category'] = e.find_previous('h4').get_text(strip=True)
    d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)
    data.append(d)

pd.DataFrame(data)[['ID',  'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','Category']]

Output

	ID	Service	Rate per 1000	Min / Max	Refill	Avg. Time	Description	Category
0	7365	YouTube – Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 – 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]	$4.80	100 / 120000	Refill 30 days	59 hours 53 minutes	Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg Start: Instant – 0 hrs Speed: 500-2k/day Refill: 30 days Drop: 0- 5% drop.	❖ Bulkfollows High Demand Services
1	7363	Spotify – 𝐅𝐑𝐄𝐄 Plays ~ 𝐋𝐢𝐟𝐞𝐓𝐢𝐦𝐞 ~ 10k-50k/days ~ USA/Russian ~ [ 𝔅𝗲𝙨𝘁 – 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]	$0.188	1000 / 100000000	Refill Lifetime	22 hours 26 minutes	Link: https://open.spotify.com/track/40Zb4FZ6nS1Hj8RVfaLkCV Start: Instant ( Avg 0-3 hrs ) Speed: 10k to 20k days Refill: Lifetime Quality: Plays from Bot Created free accounts. Make sure you know the risk of adding of bot plays Drop: Spotify Plays are stable, do not drop. Delivery Time: It will take 2-5 days to update plays. If it’s delivery 10k in 1 day, then this 10k will take 2-5 days to update, the next 10k plays will take the next 2-5 days, and so on.	❖ Bulkfollows High Demand Services
3973	7613	Australia Traffic from Instagram	$0.025	100 / 1000000	No Refill	Not enough data	💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL	⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]
3974	7614	Australia Traffic from Wikipedia	$0.025	100 / 1000000	No Refill	Not enough data	💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL	⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]

Please signup or login to give your own answer.

Click here to cancel reply.

How to scrape related category of element using BeautifulSoup? – Html

Answers

General Approach:

Example

Output