Why won't my 'really simple' scrapy CrawlSpider connect to any pages? - Debian

Someguy
February 27, 2020
115 views
0 votes
2 Answers

I have addressed the documentation. But must have overlooked something elemental. It is just a spider that starts at http://quotes.toscrape.com/, then uses only one rule and parsing function that logs the link. But it won’t crawl any pages, not even the ‘start_urls’.

Here is the code:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Crawl_All(CrawlSpider):
    name = 'Crawl_All'
    strat_urls = ['http://quotes.toscrape.com/']
    rules = [
        Rule(LinkExtractor(), callback='Parse_for_new_url', follow=True),
            ]

    def Parse_for_new_url(self, response):
        self.logger.log('got a new url:', response.url)

here is the output:

2020-02-27 13:58:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Auto_Contest)
2020-02-27 13:58:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (default, Jan  8 2020, 19:59:22) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-5.3.0-40-generic-x86_64-with-debian-buster-sid
2020-02-27 13:58:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Auto_Contest', 'NEWSPIDER_MODULE': 'Auto_Contest.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Auto_Contest.spiders']}
2020-02-27 13:58:55 [scrapy.extensions.telnet] INFO: Telnet Password: 928bba99b8a0c238
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-27 13:58:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider opened
2020-02-27 13:58:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-27 13:58:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-27 13:58:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 114277),
 'log_count/INFO': 9,
 'memusage/max': 54910976,
 'memusage/startup': 54910976,
 'start_time': datetime.datetime(2020, 2, 27, 12, 58, 56, 104321)}
2020-02-27 13:58:56 [scrapy.core.engine] INFO: Spider closed (finished)

EDIT: Solved, seems like it was due to just a simple typo strat_urls should have been start_urls

Answers

- darcycp
- February 27, 2020 at 3:21 pm
- 0 votes
0
I think it might be because you are not using the parse() class method with your spider, which may be required. I suggest trying to make your spider fit the form below and then run crawl on it. If you need to do more specific scraping from there you can chain other parsing scripts, but I think you need to have your own defined parse() first and work from there.
```
class MySpider(CrawlSpider):
name = 'testspider'
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
    print(response.url)
```
Login or Signup to reply.

You has simple typo in strat_urls which should be start_urls

You have to also use in log() two values:

information what type of message you send (ie. warning, debug, etc),
single string – so you have to concatenate 'got a new url:' + response.url

You can also use predefined functions and then you don’t need first argument but still you have to use single string

self.logger.warning('got a new url:' + response.url)

My code which can be run without creating project

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Crawl_All(CrawlSpider):
    name = 'Crawl_All'
    start_urls = ['http://quotes.toscrape.com/']
    rules = [Rule(LinkExtractor(), callback='Parse_for_new_url', follow=True),]

    def Parse_for_new_url(self, response):
        #print(response.url)
        self.logger.warning('got a new url:' + response.url)

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #
})
c.crawl(Crawl_All)
c.start()

Please signup or login to give your own answer.

Click here to cancel reply.

Why won't my 'really simple' scrapy CrawlSpider connect to any pages? – Debian

Answers