I have addressed the documentation. But must have overlooked something elemental. It is just a spider that starts at http://quotes.toscrape.com/, then uses only one rule and parsing function that logs the link. But it won’t crawl any pages, not even the ‘start_urls’.
Here is the code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Crawl_All(CrawlSpider):
name = 'Crawl_All'
strat_urls = ['http://quotes.toscrape.com/']
rules = [
Rule(LinkExtractor(), callback='Parse_for_new_url', follow=True),
]
def Parse_for_new_url(self, response):
self.logger.log('got a new url:', response.url)
here is the output:
2020-02-27 13:58:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Auto_Contest)
2020-02-27 13:58:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (default, Jan 8 2020, 19:59:22) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Linux-5.3.0-40-generic-x86_64-with-debian-buster-sid
2020-02-27 13:58:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Auto_Contest', 'NEWSPIDER_MODULE': 'Auto_Contest.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Auto_Contest.spiders']}
2020-02-27 13:58:55 [scrapy.extensions.telnet] INFO: Telnet Password: 928bba99b8a0c238
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider opened
2020-02-27 13:58:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-27 13:58:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-27 13:58:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 114277),
'log_count/INFO': 9,
'memusage/max': 54910976,
'memusage/startup': 54910976,
'start_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 104321)}
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider closed (finished)
EDIT: Solved, seems like it was due to just a simple typo strat_urls
should have been start_urls
2
Answers
I think it might be because you are not using the
parse()
class method with your spider, which may be required. I suggest trying to make your spider fit the form below and then run crawl on it. If you need to do more specific scraping from there you can chain other parsing scripts, but I think you need to have your own definedparse()
first and work from there.You has simple typo in
strat_urls
which should bestart_urls
You have to also use in
log()
two values:information what type of message you send (ie. warning, debug, etc),
single string – so you have to concatenate
'got a new url:' + response.url
You can also use predefined functions and then you don’t need first argument but still you have to use single string
My code which can be run without creating project