Writing to sql database with pandas - Mysql

elksie5000
February 19, 2023
151 views
0 votes
2 Answers

Confused. Trying to build a scraper of UK news in python.

import feedparser
import pandas as pd

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("n")

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("n")

# Example usage:
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]

for feed in feeds:
    parsed_feed = feedparser.parse(feed['url'])
    
    print("Title:", feed['title'])
    print("Number of Articles:", len(parsed_feed.entries))
    print("n")
    data = []
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        if entry.summary:
            summary = entry.summary
            data.append(summary)
        else:
            entry.summary = "No summary available"
        if entry.published:
            date = entry.published
            data.append (data)
        else:
            data.append("No data available")

I then have a bit of code to sort out the saving.

df = pd.DataFrame(data)
df.columns = ['title', 'url', 'summary', 'date']
print("data" + df)
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')  
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)

Although the nationals table has been created and the credentials are right, why does it not save?

Answers

If the credentials are correct as you say, then the to_sql call is fine. I think the problem is the Python loop to parse the feed. In particular, the line data.append (data) is creating a recursive list that cannot be constructed into a dataframe. Also, I think data list should be a nested list where each sub-list is an entry in a parsed_feed (so that each row in the dataframe is one entry).

I would write the loop as

data = []                               # <---- initialize empty list here
for feed in feeds:    
    parsed_feed = feedparser.parse(feed['url'])
    print("Title:", feed['title'])
    print("Number of Articles:", len(parsed_feed.entries))
    print("n")
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        summary = entry.summary or "No summary available" # I simplified the ternary operators here
        date = entry.published or "No data available"     # I simplified the ternary operators here
        data.append([title, url, summary, date])          # <---- append data from each entry here

df = pd.DataFrame(data, columns = ['title', 'url', 'summary', 'date'])
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')  
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)

I checked it with the feed list you provided and it works fine.

Since RSS feeds are XML files, consider pandas.read_xml and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.

Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate try...except which shows three problematic URLs in your post.

import pandas as pd

feeds = [
    {"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
    {"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
    {"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
    {"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
    {"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
    {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"},          # FIXED URL: REMOVE .xml
    {"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
    {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},         # PROBLEM URL
    {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
    {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
    {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},         # PROBLEM URL
    {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
    {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
    {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
    {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},        # PROBLEM URL
    {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}          
]

hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

def proc_rss(feed):
    rss_df = None
    print("Title:", feed['title'])

    try:
        # PARSE RSS XML W/ HEADERS, KEEP SPECIFIC COLUMNS, RENAME COLUMNS
        rss_df = (
            pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
              .reindex(["title", "link", "description", "pubDate"], axis="columns")
              .set_axis(["title", "url", "summary", "date"], axis="columns")
        )

        print("Number of Articles:", rss_df.shape[0])

    except Exception as e:
        print("Number of Articles: NONE. Reason:", e)

    print("")
    return rss_df 

# LIST COMPREHENSION BINDED TO SINGLE DATA FRAME
rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)

print(rss_df)

Output

Title: BBC
Number of Articles: 34

Title: The Economist
Number of Articles: 100

Title: The New Statesman
Number of Articles: 20

Title: The New York Times
Number of Articles: 27

Title: Metro UK
Number of Articles: 30

Title: Evening Standard
Number of Articles: 100

Title: Daily Mail
Number of Articles: 153

Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found

Title: The Mirror
Number of Articles: 25

Title: The Sun
Number of Articles: 100

Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found

Title: The Guardian
Number of Articles: 113

Title: The Independent
Number of Articles: 100

Title: The Telegraph
Number of Articles: 100

Title: The Times
Number of Articles: NONE. Reason: xmlParseEntityRef: no name, line 1, column 1556 (<string>, line 1)

Title: The Mirror
Number of Articles: 25

                                                 title                                                url                                            summary                             date
0    Nicola Bulley: Lancashire Police find body in ...  https://www.bbc.co.uk/news/uk-england-64697300...  Officers searching for the missing mother-of t...    Sun, 19 Feb 2023 17:54:18 GMT
1    Baftas 2023: All Quiet on the Western Front do...  https://www.bbc.co.uk/news/entertainment-arts-...  Netflix's World War One epic won best film and...    Sun, 19 Feb 2023 23:12:05 GMT
2    Dickie Davies, host of ITV's World of Sport fo...  https://www.bbc.co.uk/news/uk-england-lancashi...  The presenter anchored the five-hour live TV m...    Mon, 20 Feb 2023 00:47:00 GMT
3    Son Heung-min: Tottenham condemn 'utterly repr...  https://www.bbc.co.uk/sport/football/64700428?...  Tottenham call for social media companies to t...    Sun, 19 Feb 2023 22:25:04 GMT
4    Argentina Open: British number one Cameron Nor...  https://www.bbc.co.uk/sport/tennis/64700048?at...  British number one Cameron Norrie misses out o...    Sun, 19 Feb 2023 21:45:24 GMT
..                                                 ...                                                ...                                                ...                              ...
922  Nicola Bulley's family 'incredibly heartbroken...  https://www.mirror.co.uk/news/uk-news/breaking...  Lancashire Police has recovered a body around ...  Sun, 19 Feb 2023 19:51:09 +0000
923  Shamed Matt Hancock gets 'worked like a barbec...  https://www.mirror.co.uk/tv/tv-news/shamed-mat...  SAS: Who Dares Wins star Rudy Reyessays shamed...  Sun, 19 Feb 2023 19:35:03 +0000
924  Treasure hunter uses map left by his father to...  https://www.mirror.co.uk/news/world-news/treas...  Jan Glazewski dug up the silver treasure burie...  Sun, 19 Feb 2023 19:19:15 +0000
925  'My husband refuses to be in the delivery room...  https://www.mirror.co.uk/news/weird-news/my-hu...  A first-time mum-to-be says she's now feeling ...  Sun, 19 Feb 2023 19:17:34 +0000
926  Nicola Bulley search diver sends message of su...  https://www.mirror.co.uk/news/uk-news/nicola-b...  The expert search diver called in to assist wi...  Sun, 19 Feb 2023 19:16:13 +0000

[927 rows x 4 columns]

Please signup or login to give your own answer.

Click here to cancel reply.

Writing to sql database with pandas – Mysql

Answers