question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

2.6.0 breaks calling multiple Spider in CrawlerProcess()

See original GitHub issue

Description

Since 2.6.0, it breaks calling multiple Spiders from CrawlerProcess() as shown in the common practices

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

Steps to Reproduce

  1. use Scrapy=>2.6.0
  2. following is the code to reproduce
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request


class MySpider(scrapy.Spider):
    name = 'MySpider'

    def __init__(self, url, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.url = url

    def start_requests(self):
        yield Request(url=self.url, callback=self.parse)

    def parse(self, response):
        print(response.url)


process = CrawlerProcess({
    'DEPTH_LIMIT': 1,
    'DEPTH_PRIORITY': 1
})
process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider, url='https://www.google.co.jp')
process.start()

Expected behavior: [What you expect to happen]

Following is the result from Scrapy 2.5.1

2022-03-02 18:49:45 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: afe09d724aae9642
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: bd1670acfb7fb550
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.jp> (referer: None)
https://www.google.com
https://www.google.co.jp
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7675,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.465932,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 66477),
 'httpcompression/response_bytes': 15980,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 19,
 'memusage/max': 47611904,
 'memusage/startup': 47611904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 600545)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 216,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7602,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.431935,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 98011),
 'httpcompression/response_bytes': 14794,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 13,
 'memusage/max': 47669248,
 'memusage/startup': 47669248,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 666076)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)

Actual behavior: [What actually happens]

Spider fails with twisted.internet.error.ReactorAlreadyInstalledError

2022-03-02 18:49:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:12 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:12 [scrapy.extensions.telnet] INFO: Telnet Password: ce57e6aa863bb786
2022-03-02 18:49:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:13 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:13 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 25, in <module>
    process.crawl(MySpider, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Reproduces how often: [What percentage of the time does it reproduce?]

Always.

Versions

Scrapy : 2.6.1 lxml : 4.8.0.0 libxml2 : 2.9.12 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 22.1.0 Python : 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021) cryptography : 36.0.1 Platform : Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17

Additional context

The intension of using the same MySpider but from CrawlerProcess is to call Scrapy programatically using different initial url and some tweaks to parser depending on the initial url.

I think this is very fair usage and was working fine before 2.6.0.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

8reactions
Gallaeciocommented, Mar 2, 2022
1reaction
Gallaeciocommented, Jul 21, 2022

There is no ETA, but we do plan on releasing it. We have a few things we want to include into 2.6.2 before release, and the maintainers that need to review them are short in time, that is why we have been delaying.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release notes — Scrapy 2.7.1 documentation
Instead, call open_spider() first to set the Spider object. ... know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier....
Read more >
PSA: 2.6.0 breaks AT&T Bypass : r/PFSENSE - Reddit
I wasn't able to get any to function and had to downgrade to 2.5.2. 31.
Read more >
Scrapy Documentation - Read the Docs
Scrapy (/skrepa/) is an application framework for crawling web sites and extracting structured data which can be used.
Read more >
CrawlerProcess as Airflow operator causing spider to start ...
lower() def execute(self, context: Dict[str, Any]) -> Any: from units.lv import mySpider settings = get_project_settings() process = ...
Read more >
Scrapy 2.6.1 documentation
parse() : a method that will be called to handle the response downloaded for each ... you can specify spider arguments when calling...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found