Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

process_spider_exception not called with exception from spider

See original GitHub issue

According to the documentation, process_spider_exception should also be called when a spider throws an exception. To my understanding, this would include throwing an exception from any parse method like this:

    def parse_item(self, response):
        log.msg("[parse_item] Now in exceptional parse", level=log.INFO)
        raise Exception('foo')

My middleware looks like this:

class ManyExceptionsMiddleware(object):
    def process_spider_output(self, response, result, spider):
        log.msg("[process_spider_output] Shows that middleware IS installed", level=log.INFO)
        return result

    def process_spider_exception(self, response, exception, spider):
        log.msg("[process_spider_exception] Many exceptions on %s" % spider.name, level=log.WARNING)
        return []

This results in:

2015-01-18 18:08:01+0100 [example] DEBUG: Crawled (200) <GET some-secret-url> (referer: some-other-url)
2015-01-18 18:08:01+0100 [scrapy] INFO: [process_spider_output] Shows that middleware IS installed
2015-01-18 18:08:01+0100 [scrapy] INFO: [parse_item] Now in exceptional parse
2015-01-18 18:08:01+0100 [example] ERROR: Spider error processing <GET some-secret-url>
    Traceback (most recent call last):
[...]
    exceptions.Exception: foo

Then I added the following additional method to check that process_spider_exception works (because the only exception handling in scrapy itself is done like this).

def process_spider_input(self, response, spider):
    raise Exception('foo')

Then the output looks like this:

2015-01-18 18:09:53+0100 [example] DEBUG: Crawled (200) <GET some-secret-url> (referer: None)
2015-01-18 18:09:53+0100 [scrapy] WARNING: [process_spider_exception] Many exceptions on some-secret-domain
2015-01-18 18:09:53+0100 [scrapy] INFO: [process_spider_output] Shows that middleware IS installed

If you could tell me, where this all should happen, I could look into the code to fix it (if I understand it well enough).

Issue Analytics

State:
Created 9 years ago
Reactions:6
Comments:10 (3 by maintainers)

Top GitHub Comments

2reactions

elacuestacommented, Jun 26, 2019

Hello @ccc-larc, by adding those yield statements you are turning the parsing method into a generator, which makes the spider fall under the scope of #220. A fix was merged (#2061) but not yet released, it will be included in the next version. This is the output I get when running your code with the current master branch (c81d120b). Note that the item that was produced before the exception is processed normally.

2019-06-26 10:48:49 [scrapy.core.engine] INFO: Spider opened
2019-06-26 10:48:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-26 10:48:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-26 10:48:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2019-06-26 10:48:49 [root] INFO: [process_spider_output] Shows that middleware is installed
2019-06-26 10:48:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
{'an': 'item'}
2019-06-26 10:48:49 [root] INFO: [parse] Now in exceptional parse
2019-06-26 10:48:49 [root] WARNING: [process_spider_exception] Exception caught: from parse
2019-06-26 10:48:49 [scrapy.core.engine] INFO: Closing spider (finished)

0reactions

kmikecommented, Jul 4, 2019

Closing it as fixed by https://github.com/scrapy/scrapy/pull/2061.