skip to Main Content

Confused. Trying to build a scraper of UK news in python.

import feedparser
import pandas as pd

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("n")

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("n")

# Example usage:
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]

for feed in feeds:
    parsed_feed = feedparser.parse(feed['url'])
    
    print("Title:", feed['title'])
    print("Number of Articles:", len(parsed_feed.entries))
    print("n")
    data = []
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        if entry.summary:
            summary = entry.summary
            data.append(summary)
        else:
            entry.summary = "No summary available"
        if entry.published:
            date = entry.published
            data.append (data)
        else:
            data.append("No data available")

I then have a bit of code to sort out the saving.

df = pd.DataFrame(data)
df.columns = ['title', 'url', 'summary', 'date']
print("data" + df)
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')  
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)

Although the nationals table has been created and the credentials are right, why does it not save?

2

Answers


  1. If the credentials are correct as you say, then the to_sql call is fine. I think the problem is the Python loop to parse the feed. In particular, the line data.append (data) is creating a recursive list that cannot be constructed into a dataframe. Also, I think data list should be a nested list where each sub-list is an entry in a parsed_feed (so that each row in the dataframe is one entry).

    I would write the loop as

    data = []                               # <---- initialize empty list here
    for feed in feeds:    
        parsed_feed = feedparser.parse(feed['url'])
        print("Title:", feed['title'])
        print("Number of Articles:", len(parsed_feed.entries))
        print("n")
        for entry in parsed_feed.entries:
            title = entry.title
            url = entry.link
            print(entry.summary)
            summary = entry.summary or "No summary available" # I simplified the ternary operators here
            date = entry.published or "No data available"     # I simplified the ternary operators here
            data.append([title, url, summary, date])          # <---- append data from each entry here
    
    df = pd.DataFrame(data, columns = ['title', 'url', 'summary', 'date'])
    from sqlalchemy import create_engine
    import mysql.connector
    engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')  
    df.to_sql('nationals', con = engine, if_exists = 'append', index = False)
    

    I checked it with the feed list you provided and it works fine.

    Login or Signup to reply.
  2. Since RSS feeds are XML files, consider pandas.read_xml and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.

    Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate try...except which shows three problematic URLs in your post.

    import pandas as pd
    
    feeds = [
        {"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"},          # FIXED URL: REMOVE .xml
        {"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},         # PROBLEM URL
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},         # PROBLEM URL
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},        # PROBLEM URL
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}          
    ]
    
    hdr = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'
    }
    
    def proc_rss(feed):
        rss_df = None
        print("Title:", feed['title'])
    
        try:
            # PARSE RSS XML W/ HEADERS, KEEP SPECIFIC COLUMNS, RENAME COLUMNS
            rss_df = (
                pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
                  .reindex(["title", "link", "description", "pubDate"], axis="columns")
                  .set_axis(["title", "url", "summary", "date"], axis="columns")
            )
    
            print("Number of Articles:", rss_df.shape[0])
    
        except Exception as e:
            print("Number of Articles: NONE. Reason:", e)
    
        print("")
        return rss_df 
    
    # LIST COMPREHENSION BINDED TO SINGLE DATA FRAME
    rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)
    
    print(rss_df)
    

    Output

    Title: BBC
    Number of Articles: 34
    
    Title: The Economist
    Number of Articles: 100
    
    Title: The New Statesman
    Number of Articles: 20
    
    Title: The New York Times
    Number of Articles: 27
    
    Title: Metro UK
    Number of Articles: 30
    
    Title: Evening Standard
    Number of Articles: 100
    
    Title: Daily Mail
    Number of Articles: 153
    
    Title: Sky News
    Number of Articles: NONE. Reason: HTTP Error 404: Not Found
    
    Title: The Mirror
    Number of Articles: 25
    
    Title: The Sun
    Number of Articles: 100
    
    Title: Sky News
    Number of Articles: NONE. Reason: HTTP Error 404: Not Found
    
    Title: The Guardian
    Number of Articles: 113
    
    Title: The Independent
    Number of Articles: 100
    
    Title: The Telegraph
    Number of Articles: 100
    
    Title: The Times
    Number of Articles: NONE. Reason: xmlParseEntityRef: no name, line 1, column 1556 (<string>, line 1)
    
    Title: The Mirror
    Number of Articles: 25
    
                                                     title                                                url                                            summary                             date
    0    Nicola Bulley: Lancashire Police find body in ...  https://www.bbc.co.uk/news/uk-england-64697300...  Officers searching for the missing mother-of t...    Sun, 19 Feb 2023 17:54:18 GMT
    1    Baftas 2023: All Quiet on the Western Front do...  https://www.bbc.co.uk/news/entertainment-arts-...  Netflix's World War One epic won best film and...    Sun, 19 Feb 2023 23:12:05 GMT
    2    Dickie Davies, host of ITV's World of Sport fo...  https://www.bbc.co.uk/news/uk-england-lancashi...  The presenter anchored the five-hour live TV m...    Mon, 20 Feb 2023 00:47:00 GMT
    3    Son Heung-min: Tottenham condemn 'utterly repr...  https://www.bbc.co.uk/sport/football/64700428?...  Tottenham call for social media companies to t...    Sun, 19 Feb 2023 22:25:04 GMT
    4    Argentina Open: British number one Cameron Nor...  https://www.bbc.co.uk/sport/tennis/64700048?at...  British number one Cameron Norrie misses out o...    Sun, 19 Feb 2023 21:45:24 GMT
    ..                                                 ...                                                ...                                                ...                              ...
    922  Nicola Bulley's family 'incredibly heartbroken...  https://www.mirror.co.uk/news/uk-news/breaking...  Lancashire Police has recovered a body around ...  Sun, 19 Feb 2023 19:51:09 +0000
    923  Shamed Matt Hancock gets 'worked like a barbec...  https://www.mirror.co.uk/tv/tv-news/shamed-mat...  SAS: Who Dares Wins star Rudy Reyessays shamed...  Sun, 19 Feb 2023 19:35:03 +0000
    924  Treasure hunter uses map left by his father to...  https://www.mirror.co.uk/news/world-news/treas...  Jan Glazewski dug up the silver treasure burie...  Sun, 19 Feb 2023 19:19:15 +0000
    925  'My husband refuses to be in the delivery room...  https://www.mirror.co.uk/news/weird-news/my-hu...  A first-time mum-to-be says she's now feeling ...  Sun, 19 Feb 2023 19:17:34 +0000
    926  Nicola Bulley search diver sends message of su...  https://www.mirror.co.uk/news/uk-news/nicola-b...  The expert search diver called in to assist wi...  Sun, 19 Feb 2023 19:16:13 +0000
    
    [927 rows x 4 columns]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search