question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Return Deferred object from open_spider in pipeline blocks spider

See original GitHub issue

Description

“open_spider” method in pipeline can’t return Deferred object in scrapy 2.4, otherwise it would block spider. However, in earlier versions(2.3), this do work.

Steps to Reproduce

1.config in settings.py

asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) TWISTED_REACTOR = ‘twisted.internet.asyncioreactor.AsyncioSelectorReactor’

2.use the following code

import asyncio
import scrapy
from twisted.internet.defer import Deferred

class MyPipeline:
    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    async def _open_spider(self, spider: scrapy.Spider):
        spider.logger.debug("async pipeline opened!")
        self.db = await connect_to_db()

    def open_spider(self, spider):
        loop = asyncio.get_event_loop()
        # asyncio.run_coroutine_threadsafe(self._open_spider(spider),loop)
        return Deferred.fromFuture(loop.create_task(self._open_spider(spider)))

3.enable this pipeline

Expected behavior: Would execute _open_spider method.

Actual behavior:

It worked good in scrapy 2.3. However, it blocks spider in scrapy2.4. After output the following, spider get stuck and no output anymore,and did not close

2020-10-26 15:28:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapy_unitest)
2020-10-26 15:28:50 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-10-26 15:28:50 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2020-10-26 15:28:50 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_unitest',
 'COOKIES_ENABLED': False,
 'NEWSPIDER_MODULE': 'scrapy_unitest.spiders',
 'SPIDER_MODULES': ['scrapy_unitest.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'scrapy_unitest (+http://www.yourdomain.com)'}
2020-10-26 15:28:50 [scrapy.extensions.telnet] INFO: Telnet Password: 2aec1713f90e2ba2
2020-10-26 15:28:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-26 15:28:51 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_unitest.pipelines.MyPipeline']
2020-10-26 15:28:51 [scrapy.core.engine] INFO: Spider opened
2020-10-26 15:28:51 [asyncio] DEBUG: Using selector: SelectSelector

Reproduces how often: Appear when you try to return a Deferred object from open_spider method

Versions

Scrapy       : 2.4.0
lxml         : 4.6.1.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020)
cryptography : 3.1.1
Platform     : Windows-10-10.0.18362-SP0

Additional context

After changing back to a normal pipeline, the spider works again. Obviously, this is pipeline’s problem.

I wonder if there is any way to call coroutine functions from “open_spider”.

I tried loop.create_task, asyncio.run_coroutine_threadsafe, but none of them works, they just skip over the coroutine function.

Would this be fixed in new versions?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
wRARcommented, Nov 7, 2020

@kmike it was implemented only for process_item but after a very short glance I think it can be implemented for open_spider too (and probably for some other middleware/pipeline methods that are called using the same code).

3reactions
kmikecommented, Nov 7, 2020

It would be nice to support async def open_spider directly - @wRAR was it just not implemented, or is there somethng more fundamental here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Item Pipeline — Scrapy 2.7.1 documentation
Dropped items are no longer processed by further pipeline components. This method is called when the spider is opened.
Read more >
Scrapy Documentation - Read the Docs
Request objects returned by the start_requests method of the Spider. Upon receiv- ing a response for each one, it instantiates Response ...
Read more >
Scrapy pipeline spider_opened and spider_closed not being ...
Sorry, found it just after I posted this. You have to add: dispatcher.connect(self.spider_opened, signals.spider_opened) ...
Read more >
Scrapy 2.6.1 documentation
You can also write an item pipeline to store the items in a database. ... Request objects returned by the start_requests method of...
Read more >
Scrapy Documentation
Response objects are returned and then fed back to the spider, through the parse() method. Extracting Items. Introduction to Selectors.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found