question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is it possible to close the spider at spider_opened signal?

See original GitHub issue

Hello,

I’m working on a middleware that loads some resources at spider_opened handler method. If those resources can’t be loaded, I need the spider to be closed.

I tried to do that by raising the CloseSpider exception and also by calling crawler.engine.close_spider(...), but none of them works.

Is there a way to do that?

Thanks!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
wRARcommented, Sep 29, 2018

@pauloromeira more likely because of the Scrapy version. Anyway, the situation when you can’t stop a thing which is not fully started yet sounds quite common in programming, I don’t know what are the proper solutions.

1reaction
AntonGsvcommented, Apr 13, 2021

@wRAR spider:

from scrapy import signals, Spider, Request
from scrapy.exceptions import CloseSpider


class CustomDownloaderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        return middleware

    def spider_opened(self, spider):
        spider.logger.info('before CloseSpider')
        raise CloseSpider("reason")


class CustomSpider(Spider):
    name = 'custom'
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            ".".join([CustomDownloaderMiddleware.__module__, CustomDownloaderMiddleware.__name__]): 1,
        }
    }

    def start_requests(self):
        yield Request('https://api.myip.com/')

    def parse(self, response, **kwargs):
        self.logger.info(f'{self.__class__.__name__}.parse')

logs:

scrapy crawl custom
2021-04-13 17:46:36 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-04-13 17:46:36 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k
 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-04-13 17:46:36 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-04-13 17:46:36 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_MODULES': ['app.spiders.test']}
2021-04-13 17:46:36 [scrapy.extensions.telnet] INFO: Telnet Password: eb324b745c4d2fa0
2021-04-13 17:46:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-04-13 17:46:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['app.spiders.test.custom.CustomDownloaderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-04-13 17:46:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-04-13 17:46:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-04-13 17:46:37 [scrapy.core.engine] INFO: Spider opened
2021-04-13 17:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-04-13 17:46:37 [custom] INFO: before CloseSpider
2021-04-13 17:46:37 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CustomDownloaderMiddleware.spider_opened of <app.spiders.test.custom.CustomDownloaderMiddleware object at 0x0000029DA879C7F0>>
Traceback (most recent call last):
  File "d:\host\www\shelf-conscious-python-scrapers\src\.venv\lib\site-packages\scrapy\utils\defer.py", line 157, in maybeDeferred_coro
    result = f(*args, **kw)
  File "d:\host\www\shelf-conscious-python-scrapers\src\.venv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "D:\host\www\shelf-conscious-python-scrapers\src\app\spiders\test\custom.py", line 14, in spider_opened
    raise CloseSpider("reason")
scrapy.exceptions.CloseSpider
2021-04-13 17:46:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-04-13 17:46:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.myip.com/> (referer: None)
2021-04-13 17:46:37 [custom] INFO: CustomSpider.parse
2021-04-13 17:46:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-04-13 17:46:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 212,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1215,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.717408,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 4, 13, 14, 46, 37, 751142),
 'httpcompression/response_bytes': 52,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 4, 13, 14, 46, 37, 33734)}
2021-04-13 17:46:37 [scrapy.core.engine] INFO: Spider closed (finished)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Signals — Scrapy 2.7.1 documentation
If it was closed because the spider has completed scraping, the reason is 'finished' . Otherwise, if the spider was manually closed by...
Read more >
scrapy: Call a function when a spider quits - Stack Overflow
Called when the spider closes. This method provides a shortcut to signals.connect() for the spider_closed signal. Scrapy Doc : scrapy.spiders.
Read more >
Connecting to spider_closed signal inside the spider... is it safe?
I have a spider that keeps track of the urls scarped from each page that it visits. When the scraping is complete, I...
Read more >
Spider signal threads reveal remote sensing design secrets
The spiders that employ signal threads seem to make the most of the added protection from a retreat, while still being able to...
Read more >
How Ultra-Sensitive Hearing Allows Spiders to Cast a Net on ...
A study published today in Current Biology reveals that the spiders strike behind them with amazing accuracy after hearing lower-frequency tones ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found