question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple execution of process_spider_exception of the same spider middleware method for single exception

See original GitHub issue

Description

Don’t know if it should be considered as a bug but at least it should be the same behaviour if spider callback only raised an exception or raised an exception with yielding.

When raising an exception from spider callback with also yielding item or request if neither one middleware catch this exception by return iterable the sequence of process_spider_exception methods are executed multiple times. Actual behaviuor in case of n enabled middlewares and all of them returns None for thi exception: exec mw_n.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> …

Steps to Reproduce

  1. create clean project
  2. add sample spider from additional section
  3. run with spider callback raise Exception and yield {} (or any request)
  4. run with spider callback raise Exception
  5. compare the number of log output for middleware’s process_spider_exception method

Expected behavior: for any implementation of spider callback the sequence of process_spider_exception executed once

Actual behavior: sequence of process_spider_exception executed multiple times if callback raise exception and yield item. and executed once if only raises exception

Reproduces how often: 100%

Versions

Scrapy : 2.3.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.2 (default, Feb 26 2020, 14:58:38) - [GCC 8.3.0] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 3.0 Platform : Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5

Additional context

Possible related to https://github.com/scrapy/scrapy/pull/4272 and https://github.com/scrapy/scrapy/issues/4266

spider to reproduce (clean scrapy install + scrapy startproject tutorial)

import logging
import scrapy


class FirstMiddleware:
    def process_spider_output(self, response, result, spider):
        logging.info(f'FirstMiddleware - process_spider_output {result}')
        yield from result

    def process_spider_exception(self, response, exception, spider):
        logging.warning(f'FirstMiddleware - process_spider_exception {exception}')
        return None


class LastMiddleware:
    def process_spider_output(self, response, result, spider):
        logging.info(f'LastMiddleware - process_spider_output {result}')
        yield from result

    def process_spider_exception(self, response, exception, spider):
        logging.warning(f'LastMiddleware - process_spider_exception {exception}')
        return None


class InTheMiddleMiddleware:
    def process_spider_output(self, response, result, spider):
        logging.info(f'InTheMiddleMiddleware - process_spider_output {result}')
        yield from result

    def process_spider_exception(self, response, exception, spider):
        logging.warning(f'InTheMiddleMiddleware - process_spider_exception {exception}')
        return None


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    custom_settings = {
        'LOG_LEVEL': 'INFO',
        'SPIDER_MIDDLEWARES': {
            __name__ + '.FirstMiddleware': 10,
            __name__ + '.LastMiddleware': 890,
            __name__ + '.InTheMiddleMiddleware': 400,
        }
    }

    def parse(self, response):
        raise Exception('fake exception')
        yield {}
        # yield scrapy.Request('http://quotes.toscrape.com/page/2/', callback=self.parse)

output log:

scrapy crawl quotes
2020-08-15 15:16:04 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: tutorial)
2020-08-15 15:16:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Feb 26 2020, 14:58:38) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0
, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-08-15 15:16:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-15 15:16:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2020-08-15 15:16:04 [scrapy.extensions.telnet] INFO: Telnet Password: 0528714e90a1e15d
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['tutorial.spiders.quotes.FirstMiddleware',
 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'tutorial.spiders.quotes.InTheMiddleMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'tutorial.spiders.quotes.LastMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-15 15:16:04 [scrapy.core.engine] INFO: Spider opened
2020-08-15 15:16:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-15 15:16:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-15 15:16:05 [root] INFO: FirstMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256bef3c0>
2020-08-15 15:16:05 [root] INFO: InTheMiddleMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256bef2e0>
2020-08-15 15:16:05 [root] INFO: LastMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256c58970>
2020-08-15 15:16:05 [root] WARNING: LastMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: LastMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com/page/1/> (referer: None)
Traceback (most recent call last):
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/python.py", line 347, in __next__
    return next(self.data)
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/python.py", line 347, in __next__
    return next(self.data)
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 8, in process_spider_output
    yield from result
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 28, in process_spider_output
    yield from result
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 18, in process_spider_output
    yield from result
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 51, in parse
    raise Exception('fake exception')
Exception: fake exception
2020-08-15 15:16:05 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-15 15:16:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 455,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2719,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.386022,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 15, 15, 16, 5, 341508),
 'log_count/ERROR': 1,
 'log_count/INFO': 13,
 'log_count/WARNING': 15,
 'memusage/max': 55001088,
 'memusage/startup': 55001088,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/Exception': 1,
 'start_time': datetime.datetime(2020, 8, 15, 15, 16, 4, 955486)}
2020-08-15 15:16:05 [scrapy.core.engine] INFO: Spider closed (finished)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Jan 18, 2021
0reactions
GeorgeA92commented, Jan 15, 2021

@Gallaecio

Actual behaviuor in case of n enabled middlewares and all of them returns None for thi exception: exec mw_n.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> …

Yes spidermiddlewares's process_spider_exception… chain works as @vulreid described: https://github.com/scrapy/scrapy/blob/26836c4e1ae9588ee173c5977fc6611364ca7cc7/scrapy/core/spidermw.py#L76-L86 This implementation appeared as result of this code change in #2061
And as far as I understand it aimed to fix issue #220

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spider Middleware — Scrapy 2.7.1 documentation
The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process ...
Read more >
process_spider_exception not called with exception from spider
Same here, trying to capture any exception from my spiders in the method process_spider_exception from middleware, to send email reports, ...
Read more >
Scrapy not responding to CloseSpider exception
I've implemented a solution that relies on Scrapy to run multiple spiders simultaneously. Based on what I've ...
Read more >
Spider Middleware - Scrapy documentation - Read the Docs
process_spider_input() should return None or raise an exception. If it returns None , Scrapy will continue processing this response, executing all other ...
Read more >
Web scraping with Scrapy : Theoretical Understanding
Once a response arrives, the requesting process proceeds to manipulate the response. The spiders in Scrapy work in the same way.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found