Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

2.6.0 breaks calling multiple Spider in CrawlerProcess()

See original GitHub issue

Description

Since 2.6.0, it breaks calling multiple Spiders from CrawlerProcess() as shown in the common practices

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

Steps to Reproduce

use Scrapy=>2.6.0
following is the code to reproduce

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request


class MySpider(scrapy.Spider):
    name = 'MySpider'

    def __init__(self, url, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.url = url

    def start_requests(self):
        yield Request(url=self.url, callback=self.parse)

    def parse(self, response):
        print(response.url)


process = CrawlerProcess({
    'DEPTH_LIMIT': 1,
    'DEPTH_PRIORITY': 1
})
process.crawl(MySpider, url='https://www.google.com')
process.crawl(MySpider, url='https://www.google.co.jp')
process.start()

Expected behavior: [What you expect to happen]

Following is the result from Scrapy 2.5.1

2022-03-02 18:49:45 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-02 18:49:45 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: afe09d724aae9642
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:45 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet Password: bd1670acfb7fb550
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:45 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com> (referer: None)
2022-03-02 18:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.jp> (referer: None)
https://www.google.com
https://www.google.co.jp
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7675,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.465932,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 66477),
 'httpcompression/response_bytes': 15980,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 19,
 'memusage/max': 47611904,
 'memusage/startup': 47611904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 600545)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-02 18:49:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 216,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7602,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.431935,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 2, 9, 49, 46, 98011),
 'httpcompression/response_bytes': 14794,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 13,
 'memusage/max': 47669248,
 'memusage/startup': 47669248,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 2, 9, 49, 45, 666076)}
2022-03-02 18:49:46 [scrapy.core.engine] INFO: Spider closed (finished)

Actual behavior: [What actually happens]

Spider fails with twisted.internet.error.ReactorAlreadyInstalledError

2022-03-02 18:49:12 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-02 18:49:12 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
2022-03-02 18:49:12 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
2022-03-02 18:49:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-02 18:49:12 [scrapy.extensions.telnet] INFO: Telnet Password: ce57e6aa863bb786
2022-03-02 18:49:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-02 18:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-02 18:49:13 [scrapy.core.engine] INFO: Spider opened
2022-03-02 18:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-02 18:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-02 18:49:13 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DEPTH_PRIORITY': 1}
Traceback (most recent call last):
  File "/home/kusanagi/work/scrapy/test.py", line 25, in <module>
    process.crawl(MySpider, url='https://www.google.co.jp')
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/usr/local/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 256, in install
    installReactor(p)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Reproduces how often: [What percentage of the time does it reproduce?]

Always.

Versions

Scrapy : 2.6.1 lxml : 4.8.0.0 libxml2 : 2.9.12 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 22.1.0 Python : 3.9.10 (main, Jan 17 2022, 08:36:28) - [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021) cryptography : 36.0.1 Platform : Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17

Additional context

The intension of using the same MySpider but from CrawlerProcess is to call Scrapy programatically using different initial url and some tweaks to parser depending on the initial url.

I think this is very fair usage and was working fine before 2.6.0.

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:11 (6 by maintainers)

Top GitHub Comments

8reactions

Gallaeciocommented, Mar 2, 2022

I have identified https://github.com/scrapy/scrapy/commit/60c8838554a79e70c22a7c6a57baedfcaf521444 as the cause (things work with its parent commit, https://github.com/scrapy/scrapy/commit/46ef9cf771789f1db513bbf2f65243d3320ce695). Working on a fix.

1reaction

Gallaeciocommented, Jul 21, 2022

There is no ETA, but we do plan on releasing it. We have a few things we want to include into 2.6.2 before release, and the maintainers that need to review them are short in time, that is why we have been delaying.

Top Results From Across the Web

Release notes — Scrapy 2.7.1 documentation

Instead, call open_spider() first to set the Spider object. ... know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier....

PSA: 2.6.0 breaks AT&T Bypass : r/PFSENSE - Reddit

I wasn't able to get any to function and had to downgrade to 2.5.2. 31.

Scrapy Documentation - Read the Docs

Scrapy (/skrepa/) is an application framework for crawling web sites and extracting structured data which can be used.

CrawlerProcess as Airflow operator causing spider to start ...

lower() def execute(self, context: Dict[str, Any]) -> Any: from units.lv import mySpider settings = get_project_settings() process = ...

Scrapy 2.6.1 documentation

parse() : a method that will be called to handle the response downloaded for each ... you can specify spider arguments when calling...