skip to Main Content

The project: for a list of meta-data of wordpress-plugins: – approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is – those plugins that have the newest timestamp – that are updated (most) recently. It is all aobut acutality… so the base-url to start is this:

url = "https://wordpress.org/plugins/browse/popular/

aim: i want to fetch all the metadata of the plugins that we find on the first 50 pages of the popular-plugins…. for example…:

https://wordpress.org/plugins/wp-job-manager
Ninja Forms Contact Form – The Drag and Drop Form Builder for WordPress
https://wordpress.org/plugins/participants-database ....and so on and so forth.

here we go:

import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor

url = "https://wordpress.org/plugins/browse/popular/{}"


def main(url, num):
    with requests.Session() as req:
        print(f"Collecting Page# {num}")
        r = req.get(url.format(num))
        soup = BeautifulSoup(r.content, 'html.parser')
        link = [item.get("href")
                for item in soup.findAll("a", rel="bookmark")]
        return set(link)


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 50)]]

allin = []
for future in futures:
    allin.extend(future.result())


def parser(url):
    with requests.Session() as req:
        print(f"Extracting {url}")
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

that runs like so – but gives back some errors..(see below)

Extracting https://wordpress.org/plugins/use-google-libraries/
Extracting https://wordpress.org/plugins/blocksy-companion/
Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/accesspress-social-share/Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/wp-whatsapp/

here the traceback of the errors:

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Traceback (most recent call last):

  File "C:Usersrob.spyder-py3devuntitled0.py", line 51, in <module>
    print(future.result())

  File "C:UsersrobdevelIDElibconcurrentfutures_base.py", line 432, in result
    return self.__get_result()

  File "C:UsersrobdevelIDElibconcurrentfutures_base.py", line 388, in __get_result
    raise self._exception

  File "C:UsersrobdevelIDElibconcurrentfuturesthread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)

  File "C:Usersrob.spyder-py3devuntitled0.py", line 39, in parser
    target = [item.get_text(strip=True, separator=" ") for item in soup.find(

AttributeError: 'NoneType' object has no attribute 'find_next'

Update: as mentioned above i am getting this AttributeError which says that NoneType has no attribute find_next. Below is the line that’s giving the nasty problems.

target = [item.get_text(strip=True, separator=" ") for item in soup.find("h3", class_="screen-reader-text").find_next("ul").findAll("li")]

Specifically, the issue is in the soup.find() method, which can return either a Tag (when it finds something), which has a .find_next() method (i.e. attribute) or None (when it doesn’t find anything), which doesn’t. We can try extracting this whole call to its own variable, which we can then test.

tag = soup.find("h3", class_="screen-reader-text")
target = []
if tag:
    lis = tag.find_next("ul").findAll("li")
    target = [item.get_text(strip=True, separator=" ") for item in lis[:8]]

btw; we can use CSS selectors instead to get this running:

target = [item.get_text(strip=True, separator=" ") for item in soup.select("h3.screen-reader-text + ul li")[:8]]

This gets "all li anywhere under ul that’s right next to h3 with the screen-reader-text class". If we want li directly under ul (which they would usually be anyway, but that’s not always the case for other elements), we could use ul > li instead (the > means "direct child").

note: the best thing would be to dump all the results into a csv-file or – to print it out on screen.

look forward to hear from you

2

Answers


  1. The page is rather well organized so scraping it should be pretty straight forward. All you need to do is get the plugin card and then simply extract the necessary parts.

    Here’s my take on it.

    import time
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    main_url = "https://wordpress.org/plugins/browse/popular"
    headers = [
        "Title", "Rating", "Rating Count", "Excerpt", "URL",
        "Author", "Active installs", "Tested with", "Last Updated",
    ]
    
    
    def wait_a_bit(wait_for: float = 1.5):
        time.sleep(wait_for)
    
    
    def parse_plugin_card(card) -> list:
        title = card.select_one("h3").getText()
        rating = card.select_one(
            ".plugin-rating .wporg-ratings"
        )["data-rating"]
        rating_count = card.select_one(
            ".plugin-rating .rating-count a"
        ).getText().replace(" total ratings", "")
        excerpt = card.select_one(
            ".plugin-card .entry-excerpt p"
        ).getText()
        plugin_author = card.select_one(
            ".plugin-card footer span.plugin-author"
        ).getText(strip=True)
        active_installs = card.select_one(
            ".plugin-card footer span.active-installs"
        ).getText(strip=True)
        tested_with = card.select_one(
            ".plugin-card footer span.tested-with"
        ).getText(strip=True)
        last_updated = card.select_one(
            ".plugin-card footer span.last-updated"
        ).getText(strip=True)
        plugin_url = card.select_one(
            ".plugin-card .entry-title a"
        )["href"]
        return [
            title, rating, rating_count, excerpt, plugin_url,
            plugin_author, active_installs, tested_with, last_updated,
        ]
    
    
    with requests.Session() as connection:
        pages = (
            BeautifulSoup(
                connection.get(main_url).text,
                "lxml",
            ).select(".pagination .nav-links .page-numbers")
        )[-2].getText(strip=True)
    
        all_cards = []
        for page in range(1, int(pages) + 1):
            print(f"Scraping page {page} out of {pages}...")
            # deal with the first page
            page_link = f"{main_url}" if page == 1 else f"{main_url}/page/{page}"
            plugin_cards = BeautifulSoup(
                connection.get(page_link).text,
                "lxml",
            ).select(".plugin-card")
            for plugin_card in plugin_cards:
                all_cards.append(parse_plugin_card(plugin_card))
        wait_a_bit()
    
    df = pd.DataFrame(all_cards, columns=headers)
    df.to_csv("all_plugins.csv", index=False)
    

    It scrapes all the pages (currently 49 of them) and dumps everything to a .csv file with 980 rows (as of now) that looks like this:

    enter image description here

    You don’t even have to run the code, the entire dump is here.

    Login or Signup to reply.
  2. Baduker’s solution is great, but just wanted to add.

    We could slightly modify the parsing of the plugin card as there is an api that retuns all that data. Would still require small amount of processing (Ie. Pull out the content for author, the rating is stored out of 100 I believe (so a rating of 82 is really 82/100*5 = 4.1 -> "4 Stars"), and things like that.

    But thought I would share.

    import time
    
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    main_url = "https://wordpress.org/plugins/browse/popular"
    
    
    def wait_a_bit(wait_for: float = 1.5):
        time.sleep(wait_for)
    
    
    # MODIFICATION MADE HERE
    def parse_plugin_card(card):
        plugin_slug = card.select_one('a')['href'].split('/')[-2]
        url = 'https://api.wordpress.org/plugins/info/1.0/%s.json' %plugin_slug
        jsonData = requests.get(url).json()
        sections = jsonData.pop('sections')
        for k, v in sections.items():
            sections[k] = BeautifulSoup(v).text
        jsonData.update(sections)
        return jsonData
    
    
    with requests.Session() as connection:
        pages = (
            BeautifulSoup(
                connection.get(main_url).text,
                "lxml",
            ).select(".pagination .nav-links .page-numbers")
        )[-2].getText(strip=True)
    
        all_cards = []
        for page in range(1, int(pages) + 1):
            print(f"Scraping page {page} out of {pages}...")
            # deal with the first page
            page_link = f"{main_url}" if page == 1 else f"{main_url}/page/{page}"
            plugin_cards = BeautifulSoup(
                connection.get(page_link).text,
                "lxml",
            ).select(".plugin-card")
            for plugin_card in plugin_cards:
                all_cards.append(parse_plugin_card(plugin_card))
        wait_a_bit()
    
    df = pd.DataFrame(all_cards)
    df.to_csv("all_plugins.csv", index=False)
    

    Here’s just a sample showing you the columns:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search