Update spider settings during runtime
See original GitHub issueSummary
It should be possible to update some of the jobs settings while they are running. This would be specially useful for the settings related to crawling speed.
Motivation
I have experienced several cases where the crawl speed was too slow because CONCURRENT_REQUEST
was set too low initially, or for other external reasons like changes on the website availability or in proxies response times. When this happens after more than 200 hours of runtime, restarting the job in order to change the settings is not always an option.
For this reason, it should be possible to update settings dynamically while the spider is still running.
Describe alternatives you’ve considered
A current approach I have found so far, is to move the functionally you need to update into your own custom middlewares or extensions. There, the initial configuration is based on the spider settings and during the execution, it can be updated to new values based on:
- The spider status itself
- The responses from the website
- External updates from the Telnet shell.
A better approach would be to let Scrapy Core itself handle this updating process in some way, without having to duplicate this functionally in custom middlewares tailored for every setting you need to change.
One way to do that would be having signals that indicates certain setting must be updated. I see 2 different approaches here:
-
Update any spider settings and then triggering a
reload all settings
signal that would make the crawler engine to reload every single setting where required. -
Trigger an
update {setting_name} value
signal, that would make the crawler engine to reload only that settings in the part of the code where it’s needed.
In both cases, it should be possible to trigger this signals from both the spider code itself (parse methods, middlewares, extensions) and from the telnet interface or any other interface that allow access to the job internal objects.
Additional context
- Is it possible that for some settings it would not be possible or it wouldn’t make sense.
- When several spiders are running under a single crawler engine, updating the settings in one spider may have side-effects on others. Implementation needs to be aware of this to prevent introducing bugs.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:5 (4 by maintainers)
Top GitHub Comments
@VMRuiz , @Gallaecio , @wRAR
settings
attributes only one time before spidersstart_requests
method called.start_requests
(or afterspider_opened
signal) - it will not change scraping process because of p.1From my point of view on current versions of Scrapy there is only one reliable way to change parameters of process originated from
settings
values during runtime - is to directly change submodule/middleware/extension variables(or entire objects) fromSpider
methods or from custom middleware methods.Example: Let’s assume that we need to change process parameters originated from downloader related settings during runtime: ‘CONCURRENT_REQUESTS’, 'CONCURRENT_REQUESTS_PER_DOMAIN, ‘CONCURRENT_REQUESTS_PER_IP’, ‘RANDOMIZE_DOWNLOAD_DELAY’ , ‘DOWNLOAD_DELAY’
Step 1. To find piece of scrapy code where it reads that settings.
from this query we can find related code of
Downloader
class: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L79-L82 and: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L58-L59Step 2. Using some IDE (in this example PyCharm) - set breakpoint inside spider’s
parse
method and find variables from step 1:Step 3. From
parse
method we can access to downloader variables originated from settings using bindings from step 2 and make changes:Pluses With this approach we can affect every submodule/middleware/extenstion and change anything (including variables affected by
settings
values) during runtime. This approach is available to use on current versions of Scrapy Minuses Developers can find this approach very complicated as it require knowledge and understanding of scrapy source code.Also configuring an IDE can be a serious challenge for beginners as this approach requires possibility to debug and place breakpoints to literally every code line inside application. From my point of view Scrapy shell and scrapy logs can’t provide enough information required for this things.
The above informative answer should make it into the docs, to augment the telnet console example.