I am trying to webscrape this site https://bulkfollows.com/services
What I want is to get every service row which has features like this: 'ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category'
I got everything except category column a category column is a parent feature which is like these :
" YouTube - Watch Time By Length" or "Instagram - Followers [ From ✓VERIFIED ACCOUNTS]"
This is my code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category']]
I need some help in the for loop for getting parent columns
this is my output :
I want the category column not none and the description when you click for example in first service I want it to be:
Link: https://youtube.com/video Start: 0-12hrs Speed: 100-200 Per day Refill: 30 days
Please Note: Watch time will take 1-3 days to update on analytics.
After 3 days of delivery, if the watch time does not update, please
take a screenshot of your video analytic ( Not the Monetization page,
we don’t guarantee Monetization ) and upload it to prntscr.com and
send it us the uploaded screenshot ).
2
Answers
The reason your Category column only has
None
values is because the elements thatsoup.select("#serviceList tr:has(td)")
finds do NOT have the css attributedata-filter-table-category-id
. The elements its finding are like this:From what I have deciphered from your post, you want to create a table similar to the ones on bulkfollows.com except for 3 main differences:
Your table will be the aggregate of the tables on the website
Your table will contain an additional column–Category–(which will contain the Service category IDs???)
Your table’s Description column will contain the text hidden behind the purple Details buttons.
Yourself or someone else can figure out the precise solution to your problem; I will merely point you in the right direction.
General Approach:
First collect of the HTML elements that make up the individual tables. These are the div elements with the classes
col-lg-12 mb-3 ser-row
.Secondly iterate over the list of elements.
Then in each iteration:
use the same logic in your code. That is, create a dictionary with the current table’s column names and values as the keys and values, respectively.
Get the value of the css attribute data-filter-table-category-id. Create a new key, Category, and assign the css attr’s value to it.
Combine the dict’s into a DataFrame (as you did in your code).
There is no one fits all approach for scraping – So you have to select your elements more specific, may check the docs for some finding strategies.
Replace the line:
with following, that will take a look to previous
<h4>
to grab theCategory
and to the next modal to get theDescription
:Example
Output