2.6.0 breaks calling multiple Spider in CrawlerProcess()
See original GitHub issueDescription
Since 2.6.0, it breaks calling multiple Spiders from CrawlerProcess() as shown in the common practices
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Steps to Reproduce
- use Scrapy=>2.6.0
- following is the code to reproduce
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
class MySpider(scrapy.Spider):
name = 'MySpider'
def __init__(self, url, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url = url
def start_requests(self):
yield Request(url=self.url, callback=self.parse)
def parse(self, response):
print(response.url)
process = CrawlerProcess({
'DEPTH_LIMIT': 1,
'DEPTH_PRIORITY': 1
})
process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider, url='https://www.google.co.jp')
process.start()
Expected behavior: [What you expect to happen]
Following is the result from Scrapy 2.5.1
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: afe09d724aae9642
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: bd1670acfb7fb550
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.jp> (referer: None)
https://www.google.com
https://www.google.co.jp
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7675,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.465932,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 66477),
'httpcompression/response_bytes': 15980,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 19,
'memusage/max': 47611904,
'memusage/startup': 47611904,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 600545)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 216,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7602,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.431935,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 98011),
'httpcompression/response_bytes': 14794,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 13,
'memusage/max': 47669248,
'memusage/startup': 47669248,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 666076)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
Actual behavior: [What actually happens]
Spider fails with twisted.internet.error.ReactorAlreadyInstalledError
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:12 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:12 [scrapy.extensions.telnet] INFO: Telnet Password: ce57e6aa863bb786
2022-03-02 18:49:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:13 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:13 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
Traceback (most recent call last):
File "/home/kusanagi/work/scrapy/test.py", line 25, in <module>
process.crawl(MySpider, url='https://www.google.co.jp')
File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
default.install()
File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
installReactor(p)
File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
Reproduces how often: [What percentage of the time does it reproduce?]
Always.
Versions
Scrapy : 2.6.1 lxml : 4.8.0.0 libxml2 : 2.9.12 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 22.1.0 Python : 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021) cryptography : 36.0.1 Platform : Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
Additional context
The intension of using the same MySpider but from CrawlerProcess is to call Scrapy programatically using different initial url and some tweaks to parser depending on the initial url.
I think this is very fair usage and was working fine before 2.6.0.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:11 (6 by maintainers)
Top GitHub Comments
I have identified https://github.com/scrapy/scrapy/commit/60c8838554a79e70c22a7c6a57baedfcaf521444 as the cause (things work with its parent commit, https://github.com/scrapy/scrapy/commit/46ef9cf771789f1db513bbf2f65243d3320ce695). Working on a fix.
There is no ETA, but we do plan on releasing it. We have a few things we want to include into 2.6.2 before release, and the maintainers that need to review them are short in time, that is why we have been delaying.