I try to use SitemapSpider to parse sitemap. Please see the following code, How can I get additional information in the parse function from the sitemap. For example, the sitemap already contain news:keywords
and news:stock_tickers
. How do I get those data and pass to the parse function?
from scrapy.spiders import SitemapSpider
class ReutersSpider(SitemapSpider):
name = 'reuters'
sitemap_urls = ['https://www.reuters.com/sitemap_news_index1.xml']
def parse(self, response):
# How can I get data like news:stock_tickers from sitemap for this item? I only have url from the sitemap here.
yield {
'title': response.css("title ::text").extract_first(),
'url': response.url
}
Sitemap item example
<url>
<loc>
https://www.reuters.com/article/micron-tech-results/update-6-micron-sales-profit-miss-estimates-as-chip-glut-hurts-prices-idUSL3N1YN50N
</loc>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-12-19T03:50:10+00:00</news:publication_date>
<news:title>
UPDATE 6-Micron sales, profit miss estimates as chip glut hurts prices
</news:title>
<news:keywords>Headlines,Industrial Conglomerates</news:keywords>
<news:stock_tickers>
SEO:000660,SEO:005930,TYO:6502,NASDAQ:AAPL,NASDAQ:AMZN
</news:stock_tickers>
</news:news>
</url>
2
Answers
SitemapSpider
is specialized for extracting links and nothing else, so it doesn’t provide the means for extracting additional data from a sitemap.You could overwrite its
_parse_sitemap
method to pass the data in generated requests’ meta.However, if your sitemap is simple enough, it might be simpler to just do your own sitemap parsing.
AS @stranac pointed out, Scrapy is developed (and all related spiders) to get information from the web, and Sitemaps are a good way to find those links into the products on each website, but it isn’t really good at actually crawling information directly from the sitemaps.
So as suggested, you need to create your own spider, which should be like this:
Please let me know what do you think and if I can help you with anything else.