skip to Main Content

I have addressed the documentation. But must have overlooked something elemental. It is just a spider that starts at http://quotes.toscrape.com/, then uses only one rule and parsing function that logs the link. But it won’t crawl any pages, not even the ‘start_urls’.

Here is the code:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Crawl_All(CrawlSpider):
    name = 'Crawl_All'
    strat_urls = ['http://quotes.toscrape.com/']
    rules = [
        Rule(LinkExtractor(), callback='Parse_for_new_url', follow=True),
            ]

    def Parse_for_new_url(self, response):
        self.logger.log('got a new url:', response.url)

here is the output:

2020-02-27 13:58:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Auto_Contest)
2020-02-27 13:58:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (default, Jan  8 2020, 19:59:22) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-5.3.0-40-generic-x86_64-with-debian-buster-sid
2020-02-27 13:58:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Auto_Contest', 'NEWSPIDER_MODULE': 'Auto_Contest.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Auto_Contest.spiders']}
2020-02-27 13:58:55 [scrapy.extensions.telnet] INFO: Telnet Password: 928bba99b8a0c238
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider opened
2020-02-27 13:58:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-27 13:58:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-27 13:58:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 114277),
 'log_count/INFO': 9,
 'memusage/max': 54910976,
 'memusage/startup': 54910976,
 'start_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 104321)}
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider closed (finished)

EDIT: Solved, seems like it was due to just a simple typo strat_urls should have been start_urls

2

Answers


  1. I think it might be because you are not using the parse() class method with your spider, which may be required. I suggest trying to make your spider fit the form below and then run crawl on it. If you need to do more specific scraping from there you can chain other parsing scripts, but I think you need to have your own defined parse() first and work from there.

    class MySpider(CrawlSpider):
    name = 'testspider'
    start_urls = ['http://quotes.toscrape.com/']
    
    def parse(self, response):
        print(response.url)
    
    Login or Signup to reply.
  2. You has simple typo in strat_urls which should be start_urls


    You have to also use in log() two values:

    • information what type of message you send (ie. warning, debug, etc),

    • single string – so you have to concatenate 'got a new url:' + response.url

    You can also use predefined functions and then you don’t need first argument but still you have to use single string

    self.logger.warning('got a new url:' + response.url)
    

    My code which can be run without creating project

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class Crawl_All(CrawlSpider):
        name = 'Crawl_All'
        start_urls = ['http://quotes.toscrape.com/']
        rules = [Rule(LinkExtractor(), callback='Parse_for_new_url', follow=True),]
    
        def Parse_for_new_url(self, response):
            #print(response.url)
            self.logger.warning('got a new url:' + response.url)
    
    from scrapy.crawler import CrawlerProcess
    
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        # save in file CSV, JSON or XML
        'FEED_FORMAT': 'csv',     # csv, json, xml
        'FEED_URI': 'output.csv', #
    })
    c.crawl(Crawl_All)
    c.start()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search