Extensions, spider and settings initialization order
See original GitHub issueCurrent initialization order of spider, settings and all sorts of middlewares and extensions:
Crawler.__init__
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):
self.spidercls.update_settings(self.settings)
- All extensions
__init__
self.settings.freeze()
Crawler.crawl
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):
Spider.__init__
- All downloader middlewares
__init__
- All spider middlewares
__init__
- All pipelines
__init__
It’s not clear - why extensions are initialized during Crawler.__init__
and not in Crawler.crawl
? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?
I’m asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set CLOSESPIDER_TIMEOUT
based on passed spider argument. Due to CloseSpider
implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order was
'spider init' -> 'update settings' -> 'settings.freeze' -> 'middlewares init'
that task would be as easy as set CLOSESPIDER_TIMEOUT in custom_settings
.
I don’t speak about command line usage, -s
does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it’s not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when -s
doesn’t work well.
Based on above arguments I would like to propose different initialization order:
Crawler.__init__
:
- self.settings = settings.copy()
Crawler.crawl
:
Spider.__init__
spider.update_settings(self.settings)
- notice that in this case it isn’t required forupdate_settings
to be a@classmethod
self.settings.freeze()
- All extensions
__init__
- All downloader middlewares
__init__
- All spider middlewares
__init__
- All pipelines
__init__
What do you think about this proposal?
Discussion on this issue was originally started in https://github.com/scrapy/scrapy/pull/1276#issuecomment-110673089
Issue Analytics
- State:
- Created 8 years ago
- Reactions:1
- Comments:9 (9 by maintainers)
Top GitHub Comments
good point. my justification for this difference is - spider works like settings producer and it should be able to change them, other components are settings consumers, they shouldn’t be able to change settings.
@eLRuLL You’re right, it seems that all extensions features can be emulated with middleware if we make a change suggested in this ticket. Hm, but deprecating extension would mean that if you only need to connect signals you have to put it in a middleware or a spider; in spider it can’t be enabled via an option; in middleware you need to figure out where to put this middleware (to downloader middlewares? to spider middlewares? what priority to use?).