Web Scraping through links with Beautiful Soup - Artificial Intelligence

Sid
August 11, 2019
192 views
1 vote
2 Answers

I’m trying to Scrape a blog “https://blog.feedspot.com/ai_rss_feeds/” and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.

The blog follows a pattern – It has multiple RSS Feeds and each Feed has an attribute called “Site” in the UI. I need to get all the links in the “Site” attribute. Example : aitrends.com, sciecedaily.com/… etc. In the code, the main div has a class called “rss-block”, which has another nested class called “data” and each data has several

tags and the

tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.

I’ve tried various variations of the following code but nothing seemed to help much.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')

class_name='data'

dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)

I’m struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.

If you could help me finish both the parts of the problem, that’d be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

Answers

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second – commented out – uses :contains – available with bs4 4.7.1+)

from bs4 import BeautifulSoup as bs
import requests, re

p = re.compile(r'feed_domain":"(.*?)",')

with requests.Session() as s:
    r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
    soup = bs(r.content, 'lxml')
    results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
    ## or use with bs4 4.7.1 + 
    #results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
    results+= [re.sub(r'ns+','',i.replace('\','')) for i in p.findall(r.text)]

    for link in results:
        #do something e.g.
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        # extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:

from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]

Output:

[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/@Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/@Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Please signup or login to give your own answer.

Click here to cancel reply.

Web Scraping through links with Beautiful Soup – Artificial Intelligence

Answers