Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update spider settings during runtime

See original GitHub issue

Summary

It should be possible to update some of the jobs settings while they are running. This would be specially useful for the settings related to crawling speed.

Motivation

I have experienced several cases where the crawl speed was too slow because CONCURRENT_REQUEST was set too low initially, or for other external reasons like changes on the website availability or in proxies response times. When this happens after more than 200 hours of runtime, restarting the job in order to change the settings is not always an option.

For this reason, it should be possible to update settings dynamically while the spider is still running.

Describe alternatives you’ve considered

A current approach I have found so far, is to move the functionally you need to update into your own custom middlewares or extensions. There, the initial configuration is based on the spider settings and during the execution, it can be updated to new values based on:

The spider status itself
The responses from the website
External updates from the Telnet shell.

A better approach would be to let Scrapy Core itself handle this updating process in some way, without having to duplicate this functionally in custom middlewares tailored for every setting you need to change.

One way to do that would be having signals that indicates certain setting must be updated. I see 2 different approaches here:

Update any spider settings and then triggering a reload all settings signal that would make the crawler engine to reload every single setting where required.
Trigger an update {setting_name} value signal, that would make the crawler engine to reload only that settings in the part of the code where it’s needed.

In both cases, it should be possible to trigger this signals from both the spider code itself (parse methods, middlewares, extensions) and from the telnet interface or any other interface that allow access to the job internal objects.

Additional context

Is it possible that for some settings it would not be possible or it wouldn’t make sense.
When several spiders are running under a single crawler engine, updating the settings in one spider may have side-effects on others. Implementation needs to be aware of this to prevent introducing bugs.

Issue Analytics

State:
Created 4 years ago
Reactions:5
Comments:5 (4 by maintainers)

Top GitHub Comments

7reactions

GeorgeA92commented, Dec 21, 2019

@VMRuiz , @Gallaecio , @wRAR

Nearly all scrapy submodules/middlewares/extenstions (with few exceptions) read settings attributes only one time before spiders start_requests method called.
Even if You somehow change some setting attribute after start_requests (or after spider_opened signal) - it will not change scraping process because of p.1

From my point of view on current versions of Scrapy there is only one reliable way to change parameters of process originated from settings values during runtime - is to directly change submodule/middleware/extension variables(or entire objects) from Spider methods or from custom middleware methods.

Example: Let’s assume that we need to change process parameters originated from downloader related settings during runtime: ‘CONCURRENT_REQUESTS’, 'CONCURRENT_REQUESTS_PER_DOMAIN, ‘CONCURRENT_REQUESTS_PER_IP’, ‘RANDOMIZE_DOWNLOAD_DELAY’ , ‘DOWNLOAD_DELAY’

Step 1. To find piece of scrapy code where it reads that settings.

from this query we can find related code of Downloader class: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L79-L82 and: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L58-L59

Step 2. Using some IDE (in this example PyCharm) - set breakpoint inside spider’s parse method and find variables from step 1:

Step 3. From parse method we can access to downloader variables originated from settings using bindings from step 2 and make changes:

....
def parse(self, response):
    ....
    if some_condition:
        downloader = self.crawler.engine.downloader
        # CONCURRENT_REQUESTS:
        downloader.total_concurrency = new_total_concurrency
        # CONCURRENT_REQUESTS_PER_DOMAIN:     # to make changes on already existing download slots -> change downloader.slots
        downloader.domain_concurrency = new_domain_concurrency 
        # CONCURRENT_REQUESTS_PER_IP:
        downloader.ip_concurrency = new_ip_concurrency 
        # RANDOMIZE_DOWNLOAD_DELAY:
        downloader.randomize_delay = new_randomize_delay 
        # Changing existing download slots affected by:
        # CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY,RANDOMIZE_DOWNLOAD_DELAY :
        for slot_key in downloader.slots.keys():
            downloader_slot = downloader.slots[slot_key]
            downloader_slot.delay = new_delay
            downloader_slot.concurrency = new_concurrency
            downloader_slot.randomize_delay = new_randomize_delay
    ....

Pluses With this approach we can affect every submodule/middleware/extenstion and change anything (including variables affected by settings values) during runtime. This approach is available to use on current versions of Scrapy Minuses Developers can find this approach very complicated as it require knowledge and understanding of scrapy source code.

Also configuring an IDE can be a serious challenge for beginners as this approach requires possibility to debug and place breakpoints to literally every code line inside application. From my point of view Scrapy shell and scrapy logs can’t provide enough information required for this things.

1reaction

ankostiscommented, Aug 26, 2020

The above informative answer should make it into the docs, to augment the telnet console example.

Top Results From Across the Web

python 3.x - Update scrapy settings based on spider property

Save this question. Show activity on this post. Is there a way with scrapy to dynamically set the settings for a spider given...

Settings — Scrapy 2.7.1 documentation

The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders ...

Profiler — Spyder 5 documentation

The Profiler pane recursively determines the run time and number of calls for every function and method called in a file, breaking down...

Updating Configuration Settings at Runtime - SAP Help Portal

To update your Platform configuration settings at runtime, use the RuntimeConfigLoader interface and its implementation called FileBasedConfigLoader. This ...

Spider-Man: Far From Home - Wikipedia

It is the sequel to Spider-Man: Homecoming (2017) and the 23rd film in the Marvel Cinematic Universe (MCU). The film was directed by...