skip to Main Content

I try to use SitemapSpider to parse sitemap. Please see the following code, How can I get additional information in the parse function from the sitemap. For example, the sitemap already contain news:keywords and news:stock_tickers. How do I get those data and pass to the parse function?

from scrapy.spiders import SitemapSpider
class ReutersSpider(SitemapSpider):
    name = 'reuters'
    sitemap_urls = ['https://www.reuters.com/sitemap_news_index1.xml']

    def parse(self, response):
        # How can I get data like news:stock_tickers from sitemap for this item? I only have url from the sitemap here.
        yield {
            'title': response.css("title ::text").extract_first(),
            'url': response.url
        }

Sitemap item example

<url>
<loc>
https://www.reuters.com/article/micron-tech-results/update-6-micron-sales-profit-miss-estimates-as-chip-glut-hurts-prices-idUSL3N1YN50N
</loc>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-12-19T03:50:10+00:00</news:publication_date>
<news:title>
UPDATE 6-Micron sales, profit miss estimates as chip glut hurts prices
</news:title>
<news:keywords>Headlines,Industrial Conglomerates</news:keywords>
<news:stock_tickers>
SEO:000660,SEO:005930,TYO:6502,NASDAQ:AAPL,NASDAQ:AMZN
</news:stock_tickers>
</news:news>
</url>

2

Answers


  1. SitemapSpider is specialized for extracting links and nothing else, so it doesn’t provide the means for extracting additional data from a sitemap.

    You could overwrite its _parse_sitemap method to pass the data in generated requests’ meta.
    However, if your sitemap is simple enough, it might be simpler to just do your own sitemap parsing.

    Login or Signup to reply.
  2. AS @stranac pointed out, Scrapy is developed (and all related spiders) to get information from the web, and Sitemaps are a good way to find those links into the products on each website, but it isn’t really good at actually crawling information directly from the sitemaps.

    So as suggested, you need to create your own spider, which should be like this:

    from scrapy import Spider
    from lxml import etree
    
    
    class MySpider(Spider):
        name = 'sitemap_example'
    
        def start_requests(self):
            yield Request('https://www.reuters.com/sitemap_news_index1.xml')
    
        def parse(self, response):
            sitemap = etree.fromstring(response.body)
            for child in sitemap.getchildren():
                inner_children = child.getchildren()
                news_child = [x for x in inner_children if 'news' in x.tag]
                if not news_child:
                    continue
                else:
                    news_child = news_child[0]
                    stock_child = [x for x in news_child if 'stock_tickers' in x.tag]
                    keywords_child = [x for x in news_child if 'keywords' in x.tag]
                    title_child = [x for x in news_child if 'title' in x.tag]
                    if stock_child:
                        yield {
                            'stock_tickers': stock_child[0].text,
                            'keywords': keywords_child[0].text,
                            'title': title_child[0].text,
                        }
    

    Please let me know what do you think and if I can help you with anything else.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search