Confused. Trying to build a scraper of UK news in python.
import feedparser
import pandas as pd
def poll_rss(rss_url):
feed = feedparser.parse(rss_url)
for entry in feed.entries:
print("Title:", entry.title)
print("Description:", entry.description)
print("n")
def poll_rss(rss_url):
feed = feedparser.parse(rss_url)
for entry in feed.entries:
print("Title:", entry.title)
print("Description:", entry.description)
print("n")
# Example usage:
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
{"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
{"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
{"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
{"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
{"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
{"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
{"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
{"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
{"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
{"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]
for feed in feeds:
parsed_feed = feedparser.parse(feed['url'])
print("Title:", feed['title'])
print("Number of Articles:", len(parsed_feed.entries))
print("n")
data = []
for entry in parsed_feed.entries:
title = entry.title
url = entry.link
print(entry.summary)
if entry.summary:
summary = entry.summary
data.append(summary)
else:
entry.summary = "No summary available"
if entry.published:
date = entry.published
data.append (data)
else:
data.append("No data available")
I then have a bit of code to sort out the saving.
df = pd.DataFrame(data)
df.columns = ['title', 'url', 'summary', 'date']
print("data" + df)
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)
Although the nationals table has been created and the credentials are right, why does it not save?
2
Answers
If the credentials are correct as you say, then the
to_sql
call is fine. I think the problem is the Python loop to parse the feed. In particular, the linedata.append (data)
is creating a recursive list that cannot be constructed into a dataframe. Also, I thinkdata
list should be a nested list where each sub-list is an entry in aparsed_feed
(so that each row in the dataframe is one entry).I would write the loop as
I checked it with the feed list you provided and it works fine.
Since RSS feeds are XML files, consider
pandas.read_xml
and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate
try...except
which shows three problematic URLs in your post.Output