question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update spider settings during runtime

See original GitHub issue

Summary

It should be possible to update some of the jobs settings while they are running. This would be specially useful for the settings related to crawling speed.

Motivation

I have experienced several cases where the crawl speed was too slow because CONCURRENT_REQUEST was set too low initially, or for other external reasons like changes on the website availability or in proxies response times. When this happens after more than 200 hours of runtime, restarting the job in order to change the settings is not always an option.

For this reason, it should be possible to update settings dynamically while the spider is still running.

Describe alternatives you’ve considered

A current approach I have found so far, is to move the functionally you need to update into your own custom middlewares or extensions. There, the initial configuration is based on the spider settings and during the execution, it can be updated to new values based on:

  1. The spider status itself
  2. The responses from the website
  3. External updates from the Telnet shell.

A better approach would be to let Scrapy Core itself handle this updating process in some way, without having to duplicate this functionally in custom middlewares tailored for every setting you need to change.

One way to do that would be having signals that indicates certain setting must be updated. I see 2 different approaches here:

  1. Update any spider settings and then triggering a reload all settings signal that would make the crawler engine to reload every single setting where required.

  2. Trigger an update {setting_name} value signal, that would make the crawler engine to reload only that settings in the part of the code where it’s needed.

In both cases, it should be possible to trigger this signals from both the spider code itself (parse methods, middlewares, extensions) and from the telnet interface or any other interface that allow access to the job internal objects.

Additional context

  1. Is it possible that for some settings it would not be possible or it wouldn’t make sense.
  2. When several spiders are running under a single crawler engine, updating the settings in one spider may have side-effects on others. Implementation needs to be aware of this to prevent introducing bugs.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:5
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
GeorgeA92commented, Dec 21, 2019

@VMRuiz , @Gallaecio , @wRAR

  1. Nearly all scrapy submodules/middlewares/extenstions (with few exceptions) read settings attributes only one time before spiders start_requests method called.
  2. Even if You somehow change some setting attribute after start_requests (or after spider_opened signal) - it will not change scraping process because of p.1

From my point of view on current versions of Scrapy there is only one reliable way to change parameters of process originated from settings values during runtime - is to directly change submodule/middleware/extension variables(or entire objects) from Spider methods or from custom middleware methods.

Example: Let’s assume that we need to change process parameters originated from downloader related settings during runtime: ‘CONCURRENT_REQUESTS’, 'CONCURRENT_REQUESTS_PER_DOMAIN, ‘CONCURRENT_REQUESTS_PER_IP’, ‘RANDOMIZE_DOWNLOAD_DELAY’ , ‘DOWNLOAD_DELAY’

Step 1. To find piece of scrapy code where it reads that settings.

from this query we can find related code of Downloader class: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L79-L82 and: https://github.com/scrapy/scrapy/blob/5577d4d2be34bc68ace26c747ca636baa1563e5e/scrapy/core/downloader/__init__.py#L58-L59

Step 2. Using some IDE (in this example PyCharm) - set breakpoint inside spider’s parse method and find variables from step 1: 1

Step 3. From parse method we can access to downloader variables originated from settings using bindings from step 2 and make changes:

....
def parse(self, response):
    ....
    if some_condition:
        downloader = self.crawler.engine.downloader
        # CONCURRENT_REQUESTS:
        downloader.total_concurrency = new_total_concurrency
        # CONCURRENT_REQUESTS_PER_DOMAIN:     # to make changes on already existing download slots -> change downloader.slots
        downloader.domain_concurrency = new_domain_concurrency 
        # CONCURRENT_REQUESTS_PER_IP:
        downloader.ip_concurrency = new_ip_concurrency 
        # RANDOMIZE_DOWNLOAD_DELAY:
        downloader.randomize_delay = new_randomize_delay 
        # Changing existing download slots affected by:
        # CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY,RANDOMIZE_DOWNLOAD_DELAY :
        for slot_key in downloader.slots.keys():
            downloader_slot = downloader.slots[slot_key]
            downloader_slot.delay = new_delay
            downloader_slot.concurrency = new_concurrency
            downloader_slot.randomize_delay = new_randomize_delay
    ....

Pluses With this approach we can affect every submodule/middleware/extenstion and change anything (including variables affected by settings values) during runtime. This approach is available to use on current versions of Scrapy Minuses Developers can find this approach very complicated as it require knowledge and understanding of scrapy source code.

Also configuring an IDE can be a serious challenge for beginners as this approach requires possibility to debug and place breakpoints to literally every code line inside application. From my point of view Scrapy shell and scrapy logs can’t provide enough information required for this things.

1reaction
ankostiscommented, Aug 26, 2020

The above informative answer should make it into the docs, to augment the telnet console example.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python 3.x - Update scrapy settings based on spider property
Save this question. Show activity on this post. Is there a way with scrapy to dynamically set the settings for a spider given...
Read more >
Settings — Scrapy 2.7.1 documentation
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders ...
Read more >
Profiler — Spyder 5 documentation
The Profiler pane recursively determines the run time and number of calls for every function and method called in a file, breaking down...
Read more >
Updating Configuration Settings at Runtime - SAP Help Portal
To update your Platform configuration settings at runtime, use the RuntimeConfigLoader interface and its implementation called FileBasedConfigLoader. This ...
Read more >
Spider-Man: Far From Home - Wikipedia
It is the sequel to Spider-Man: Homecoming (2017) and the 23rd film in the Marvel Cinematic Universe (MCU). The film was directed by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found