Multiple execution of process_spider_exception of the same spider middleware method for single exception
See original GitHub issueDescription
Don’t know if it should be considered as a bug but at least it should be the same behaviour if spider callback only raised an exception or raised an exception with yielding.
When raising an exception from spider callback with also yielding item or request if neither one middleware catch this exception by return iterable
the sequence of process_spider_exception methods are executed multiple times.
Actual behaviuor in case of n enabled middlewares and all of them returns None for thi exception:
exec mw_n.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> (mw_n-1).process_spider_exception -> (mw_n-2).process_spider_exception -> … -> mw_1.process_spider_exception -> …
Steps to Reproduce
- create clean project
- add sample spider from additional section
- run with spider callback raise Exception and yield {} (or any request)
- run with spider callback raise Exception
- compare the number of log output for middleware’s process_spider_exception method
Expected behavior: for any implementation of spider callback the sequence of process_spider_exception executed once
Actual behavior: sequence of process_spider_exception executed multiple times if callback raise exception and yield item. and executed once if only raises exception
Reproduces how often: 100%
Versions
Scrapy : 2.3.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.2 (default, Feb 26 2020, 14:58:38) - [GCC 8.3.0] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 3.0 Platform : Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
Additional context
Possible related to https://github.com/scrapy/scrapy/pull/4272 and https://github.com/scrapy/scrapy/issues/4266
spider to reproduce (clean scrapy install + scrapy startproject tutorial)
import logging
import scrapy
class FirstMiddleware:
def process_spider_output(self, response, result, spider):
logging.info(f'FirstMiddleware - process_spider_output {result}')
yield from result
def process_spider_exception(self, response, exception, spider):
logging.warning(f'FirstMiddleware - process_spider_exception {exception}')
return None
class LastMiddleware:
def process_spider_output(self, response, result, spider):
logging.info(f'LastMiddleware - process_spider_output {result}')
yield from result
def process_spider_exception(self, response, exception, spider):
logging.warning(f'LastMiddleware - process_spider_exception {exception}')
return None
class InTheMiddleMiddleware:
def process_spider_output(self, response, result, spider):
logging.info(f'InTheMiddleMiddleware - process_spider_output {result}')
yield from result
def process_spider_exception(self, response, exception, spider):
logging.warning(f'InTheMiddleMiddleware - process_spider_exception {exception}')
return None
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
custom_settings = {
'LOG_LEVEL': 'INFO',
'SPIDER_MIDDLEWARES': {
__name__ + '.FirstMiddleware': 10,
__name__ + '.LastMiddleware': 890,
__name__ + '.InTheMiddleMiddleware': 400,
}
}
def parse(self, response):
raise Exception('fake exception')
yield {}
# yield scrapy.Request('http://quotes.toscrape.com/page/2/', callback=self.parse)
output log:
scrapy crawl quotes
2020-08-15 15:16:04 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: tutorial)
2020-08-15 15:16:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Feb 26 2020, 14:58:38) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0
, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-08-15 15:16:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-15 15:16:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['tutorial.spiders']}
2020-08-15 15:16:04 [scrapy.extensions.telnet] INFO: Telnet Password: 0528714e90a1e15d
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['tutorial.spiders.quotes.FirstMiddleware',
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'tutorial.spiders.quotes.InTheMiddleMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'tutorial.spiders.quotes.LastMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-15 15:16:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-15 15:16:04 [scrapy.core.engine] INFO: Spider opened
2020-08-15 15:16:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-15 15:16:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-15 15:16:05 [root] INFO: FirstMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256bef3c0>
2020-08-15 15:16:05 [root] INFO: InTheMiddleMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256bef2e0>
2020-08-15 15:16:05 [root] INFO: LastMiddleware - process_spider_output <generator object SpiderMiddlewareManager.scrape_response.<locals>._evaluate_iterable at 0x7f1256c58970>
2020-08-15 15:16:05 [root] WARNING: LastMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: LastMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: InTheMiddleMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [root] WARNING: FirstMiddleware - process_spider_exception fake exception
2020-08-15 15:16:05 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com/page/1/> (referer: None)
Traceback (most recent call last):
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
yield next(it)
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/python.py", line 347, in __next__
return next(self.data)
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/utils/python.py", line 347, in __next__
return next(self.data)
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 8, in process_spider_output
yield from result
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 28, in process_spider_output
yield from result
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 18, in process_spider_output
yield from result
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/var/www/scrapy_test/.venv/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/var/www/scrapy_test/tutorial/tutorial/spiders/quotes.py", line 51, in parse
raise Exception('fake exception')
Exception: fake exception
2020-08-15 15:16:05 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-15 15:16:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 455,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2719,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.386022,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 15, 15, 16, 5, 341508),
'log_count/ERROR': 1,
'log_count/INFO': 13,
'log_count/WARNING': 15,
'memusage/max': 55001088,
'memusage/startup': 55001088,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/Exception': 1,
'start_time': datetime.datetime(2020, 8, 15, 15, 16, 4, 955486)}
2020-08-15 15:16:05 [scrapy.core.engine] INFO: Spider closed (finished)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:6 (5 by maintainers)
Top GitHub Comments
cc @elacuesta
@Gallaecio
Yes
spidermiddlewares's
process_spider_exception
… chain works as @vulreid described: https://github.com/scrapy/scrapy/blob/26836c4e1ae9588ee173c5977fc6611364ca7cc7/scrapy/core/spidermw.py#L76-L86 This implementation appeared as result of this code change in #2061And as far as I understand it aimed to fix issue #220