question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extensions, spider and settings initialization order

See original GitHub issue

Current initialization order of spider, settings and all sorts of middlewares and extensions:

Crawler.__init__ (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):

  1. self.spidercls.update_settings(self.settings)
  2. All extensions __init__
  3. self.settings.freeze()

Crawler.crawl (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):

  1. Spider.__init__
  2. All downloader middlewares __init__
  3. All spider middlewares __init__
  4. All pipelines __init__

It’s not clear - why extensions are initialized during Crawler.__init__ and not in Crawler.crawl? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?

I’m asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set CLOSESPIDER_TIMEOUT based on passed spider argument. Due to CloseSpider implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order was

'spider init' -> 'update settings' -> 'settings.freeze' -> 'middlewares init' 

that task would be as easy as set CLOSESPIDER_TIMEOUT in custom_settings.

I don’t speak about command line usage, -s does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it’s not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when -s doesn’t work well.

Based on above arguments I would like to propose different initialization order:

Crawler.__init__:

  1. self.settings = settings.copy()

Crawler.crawl:

  1. Spider.__init__
  2. spider.update_settings(self.settings) - notice that in this case it isn’t required for update_settings to be a @classmethod
  3. self.settings.freeze()
  4. All extensions __init__
  5. All downloader middlewares __init__
  6. All spider middlewares __init__
  7. All pipelines __init__

What do you think about this proposal?

Discussion on this issue was originally started in https://github.com/scrapy/scrapy/pull/1276#issuecomment-110673089

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:1
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
chekunkovcommented, Oct 30, 2015

spiders won’t be able to read and use final settings in their init method, and their from_crawler method will become different from from_crawler methods of all other components

good point. my justification for this difference is - spider works like settings producer and it should be able to change them, other components are settings consumers, they shouldn’t be able to change settings.

0reactions
kmikecommented, Mar 4, 2016

@eLRuLL You’re right, it seems that all extensions features can be emulated with middleware if we make a change suggested in this ticket. Hm, but deprecating extension would mean that if you only need to connect signals you have to put it in a middleware or a spider; in spider it can’t be enabled via an option; in middleware you need to figure out where to put this middleware (to downloader middlewares? to spider middlewares? what priority to use?).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extensions — Scrapy 2.7.1 documentation
Extensions use the Scrapy settings to manage their settings, ... All the extension initialization code must be performed in the class ...
Read more >
Initialization of extension settings
My intention is to set the initial values of the settings' variables during the onInstalled event, but avoid resetting them during a browser ......
Read more >
Log4J2 Configuration: A Detailed Guide to Getting Started
Configuration File Names. When log4j scans the classpath, it looks for one of two filenames: log4j2-test.[extension] or log4j2.[extension] ...
Read more >
Server System Variables - MariaDB Knowledge Base
See also the Full list of MariaDB options, system and status variables. ... If this variable is changed, the full-text index must be...
Read more >
curl_setopt_array - Manual
curl_setopt_array — Set multiple options for a cURL transfer ... Example #1 Initializing a new cURL session and fetching a web page. <?php...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found