The project: for a list of meta-data of wordpress-plugins: – approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is – those plugins that have the newest timestamp – that are updated (most) recently. It is all aobut acutality… so the base-url to start is this:
url = "https://wordpress.org/plugins/browse/popular/
aim: i want to fetch all the metadata of the plugins that we find on the first 50 pages of the popular-plugins…. for example…:
https://wordpress.org/plugins/wp-job-manager
Ninja Forms Contact Form – The Drag and Drop Form Builder for WordPress
https://wordpress.org/plugins/participants-database ....and so on and so forth.
here we go:
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
that runs like so – but gives back some errors..(see below)
Extracting https://wordpress.org/plugins/use-google-libraries/
Extracting https://wordpress.org/plugins/blocksy-companion/
Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/accesspress-social-share/Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/wp-whatsapp/
here the traceback of the errors:
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Traceback (most recent call last):
File "C:Usersrob.spyder-py3devuntitled0.py", line 51, in <module>
print(future.result())
File "C:UsersrobdevelIDElibconcurrentfutures_base.py", line 432, in result
return self.__get_result()
File "C:UsersrobdevelIDElibconcurrentfutures_base.py", line 388, in __get_result
raise self._exception
File "C:UsersrobdevelIDElibconcurrentfuturesthread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "C:Usersrob.spyder-py3devuntitled0.py", line 39, in parser
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
AttributeError: 'NoneType' object has no attribute 'find_next'
Update: as mentioned above i am getting this AttributeError which says that NoneType has no attribute find_next. Below is the line that’s giving the nasty problems.
target = [item.get_text(strip=True, separator=" ") for item in soup.find("h3", class_="screen-reader-text").find_next("ul").findAll("li")]
Specifically, the issue is in the soup.find() method, which can return either a Tag (when it finds something), which has a .find_next() method (i.e. attribute) or None (when it doesn’t find anything), which doesn’t. We can try extracting this whole call to its own variable, which we can then test.
tag = soup.find("h3", class_="screen-reader-text")
target = []
if tag:
lis = tag.find_next("ul").findAll("li")
target = [item.get_text(strip=True, separator=" ") for item in lis[:8]]
btw; we can use CSS selectors instead to get this running:
target = [item.get_text(strip=True, separator=" ") for item in soup.select("h3.screen-reader-text + ul li")[:8]]
This gets "all li anywhere under ul that’s right next to h3 with the screen-reader-text class". If we want li directly under ul (which they would usually be anyway, but that’s not always the case for other elements), we could use ul > li instead (the > means "direct child").
note: the best thing would be to dump all the results into a csv-file or – to print it out on screen.
look forward to hear from you
2
Answers
The page is rather well organized so scraping it should be pretty straight forward. All you need to do is get the plugin card and then simply extract the necessary parts.
Here’s my take on it.
It scrapes all the pages (currently 49 of them) and dumps everything to a
.csv
file with980
rows (as of now) that looks like this:You don’t even have to run the code, the entire dump is here.
Baduker’s solution is great, but just wanted to add.
We could slightly modify the parsing of the plugin card as there is an api that retuns all that data. Would still require small amount of processing (Ie. Pull out the content for author, the rating is stored out of 100 I believe (so a rating of 82 is really 82/100*5 = 4.1 -> "4 Stars"), and things like that.
But thought I would share.
Here’s just a sample showing you the columns: