TWISTED_REACTOR setting not honored from Spider.custom_settings
See original GitHub issueDescription
The value of the TWISTED_REACTOR
setting is not taken into account if the setting is specified in a spider’s custom_settings
attribute.
It works well if the setting is specified in a project’s settings file, as a parameter when creating a CrawlerProcess
object (as the tests show) or as a CLI argument with the -s
option.
Steps to Reproduce
- Create a file with the following contents:
# asyncio_spider.py
import asyncio
from scrapy import Spider
class AsyncIOSpider(Spider):
name = "asyncio"
start_urls = ["https://example.org"]
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
}
async def parse(self, response):
await asyncio.sleep(1)
yield {"foo": "bar"}
- Execute the spider:
scrapy runspider asyncio_spider.py
Expected behavior: The spider should run fine, without exceptions
Actual behavior: The following exception is raised:
2020-04-12 00:04:23 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
File "/.../scrapy/venv-scrapy/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/.../scrapy/scrapy/utils/py36.py", line 8, in collect_asyncgen
async for x in result:
File "/.../scrapy/test-spiders/asyncio_spider.py", line 14, in parse
await asyncio.sleep(1)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 595, in sleep
return await future
RuntimeError: await wasn't used with future
This is because the asyncio
-based reactor is not actually installed, as the third log line of the job shows:
2020-04-12 00:04:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Full logs:
2020-04-12 00:04:23 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
2020-04-12 00:04:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-04-12 00:04:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-12 00:04:23 [scrapy.crawler] INFO: Overridden settings:
{'EDITOR': 'nano',
'SPIDER_LOADER_WARN_ONLY': True,
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2020-04-12 00:04:23 [scrapy.extensions.telnet] INFO: Telnet Password: a2bd966307b319f0
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Spider opened
2020-04-12 00:04:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-12 00:04:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-12 00:04:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2020-04-12 00:04:23 [asyncio] DEBUG: Using selector: KqueueSelector
2020-04-12 00:04:23 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
File "/.../scrapy/venv-scrapy/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/.../scrapy/scrapy/utils/py36.py", line 8, in collect_asyncgen
async for x in result:
File "/.../scrapy/test-spiders/asyncio_spider.py", line 14, in parse
await asyncio.sleep(1)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 595, in sleep
return await future
RuntimeError: await wasn't used with future
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-12 00:04:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1001,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.778025,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 12, 3, 4, 23, 925629),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 54767616,
'memusage/startup': 54767616,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/RuntimeError': 1,
'start_time': datetime.datetime(2020, 4, 12, 3, 4, 23, 147604)}
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Spider closed (finished)
Reproduces how often: 100% of the times.
Versions
Scrapy : 2.0.1
lxml : 4.5.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 20.3.0
Python : 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Darwin-18.7.0-x86_64-i386-64bit
Scrapy : 2.0.1
lxml : 4.5.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Linux-4.15.0-96-generic-x86_64-with-LinuxMint-19.1-tessa
Additional context
From a quick look, it seems like this happens because the custom reactor is installed in CrawlerRunner.__init__
, but the spider class is only available in CrawlerRunner.crawl
or CrawlerRunner.create_crawler
.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:7
- Comments:15 (11 by maintainers)
Top GitHub Comments
I think we should only support matching settings and take custom_settings from the first spider, while showing an error if the following spiders have different custom_settings.
Running multiple spiders in the same process
As far as I know: Running multiple spiders in the same process also means that multiple spiders will use single instantse of
reactor
(1 reactor - per process).As result: In case if
CrawlerProcess
will have spiders with different reactor settings - it is not clear, what is expected outcome of this?