question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TWISTED_REACTOR setting not honored from Spider.custom_settings

See original GitHub issue

Description

The value of the TWISTED_REACTOR setting is not taken into account if the setting is specified in a spider’s custom_settings attribute. It works well if the setting is specified in a project’s settings file, as a parameter when creating a CrawlerProcess object (as the tests show) or as a CLI argument with the -s option.

Steps to Reproduce

  1. Create a file with the following contents:
# asyncio_spider.py
import asyncio

from scrapy import Spider


class AsyncIOSpider(Spider):
    name = "asyncio"
    start_urls = ["https://example.org"]
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    async def parse(self, response):
        await asyncio.sleep(1)
        yield {"foo": "bar"}
  1. Execute the spider: scrapy runspider asyncio_spider.py

Expected behavior: The spider should run fine, without exceptions

Actual behavior: The following exception is raised:

2020-04-12 00:04:23 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
  File "/.../scrapy/venv-scrapy/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/.../scrapy/scrapy/utils/py36.py", line 8, in collect_asyncgen
    async for x in result:
  File "/.../scrapy/test-spiders/asyncio_spider.py", line 14, in parse
    await asyncio.sleep(1)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 595, in sleep
    return await future
RuntimeError: await wasn't used with future

This is because the asyncio-based reactor is not actually installed, as the third log line of the job shows:

2020-04-12 00:04:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

Full logs:

2020-04-12 00:04:23 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
2020-04-12 00:04:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-04-12 00:04:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-12 00:04:23 [scrapy.crawler] INFO: Overridden settings:
{'EDITOR': 'nano',
 'SPIDER_LOADER_WARN_ONLY': True,
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2020-04-12 00:04:23 [scrapy.extensions.telnet] INFO: Telnet Password: a2bd966307b319f0
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-12 00:04:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Spider opened
2020-04-12 00:04:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-12 00:04:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-12 00:04:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2020-04-12 00:04:23 [asyncio] DEBUG: Using selector: KqueueSelector
2020-04-12 00:04:23 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
  File "/.../scrapy/venv-scrapy/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/.../scrapy/scrapy/utils/py36.py", line 8, in collect_asyncgen
    async for x in result:
  File "/.../scrapy/test-spiders/asyncio_spider.py", line 14, in parse
    await asyncio.sleep(1)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 595, in sleep
    return await future
RuntimeError: await wasn't used with future
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-12 00:04:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1001,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.778025,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 4, 12, 3, 4, 23, 925629),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 54767616,
 'memusage/startup': 54767616,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/RuntimeError': 1,
 'start_time': datetime.datetime(2020, 4, 12, 3, 4, 23, 147604)}
2020-04-12 00:04:23 [scrapy.core.engine] INFO: Spider closed (finished)

Reproduces how often: 100% of the times.

Versions

Scrapy       : 2.0.1
lxml         : 4.5.0.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Darwin-18.7.0-x86_64-i386-64bit
Scrapy       : 2.0.1
lxml         : 4.5.0.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Linux-4.15.0-96-generic-x86_64-with-LinuxMint-19.1-tessa

Additional context

From a quick look, it seems like this happens because the custom reactor is installed in CrawlerRunner.__init__, but the spider class is only available in CrawlerRunner.crawl or CrawlerRunner.create_crawler.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:7
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
wRARcommented, Dec 8, 2021

I think we should only support matching settings and take custom_settings from the first spider, while showing an error if the following spiders have different custom_settings.

1reaction
GeorgeA92commented, Dec 4, 2021

Running multiple spiders in the same process

...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

As far as I know: Running multiple spiders in the same process also means that multiple spiders will use single instantse of reactor(1 reactor - per process).

As result: In case if CrawlerProcess will have spiders with different reactor settings - it is not clear, what is expected outcome of this?

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2_with_TWISTED_REACTOR_in_custom_settings)
process.start()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Twisted Reactor not restarting in scrapy - Stack Overflow
I'm trying to run a scrapy spider via a Telegram bot using the python-telegram-bot API wrapper. Using the below code, I can successfully ......
Read more >
Settings — Scrapy 2.7.1 documentation
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders ...
Read more >
Using Scrapy from a single Python script - DEV Community ‍ ‍
CrawlerProcess assumes that a twisted reactor is NOT used by anything else, like for example another spider. With that lets look at the...
Read more >
Scrapy Documentation - Read the Docs
The next steps for you are to install Scrapy, follow through the ... you to not conflict with already-installed Python system packages ...
Read more >
Run Scrapy code from Jupyter Notebook without issues
This is achieved using the custom settings and passing a nested dictionary ... decorator on the function that runs the spider from scrapy....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found