Ability to retry a request from inside a spider callback
See original GitHub issueThere are situations where websites return 200 responses but the content is not available due to bans or temporal issues which can be fixed by retrying requests.
There should be an easier way to retry requests inside spider callbacks, which should ideally reuse the code in Retry downloader middleware.
I see two approaches for this.
-
Introduce new exception called RetryRequest which can be raised inside a spider callback to indicate a retry. I personally prefer this but the implementation of this is a little untidy due to this bug #220
from scrapy.exceptions import RetryRequest def parse(self, response): if response.xpath('//title[text()="Content not found"]'): raise RetryRequest('Missing content')
-
Introduce a new class RetryRequest which wraps a request that needs to be retried. A RetryRequest can be yielded from a spider callback to indicate a retry
from scrapy.http import RetryRequest def parse(self, response): if response.xpath('//title[text()="Content not found"]'): yield RetryRequest(response.request, reason='Missing content')
Will be sending two PRs for the two approaches. Happy to hear about any other alternatives too.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:20 (12 by maintainers)
Top GitHub Comments
@Gallaecio I miss everything the retry middleware currently does.
You can’t infinitely do this. you need to count the retries and give up after a certain retry attempts You need to have stats and log messages
IMO there should be a easier way to do above things than re-implement the same code in every project/spider.
@ejulio I agree. There are 2 non private methods where
_retry
method called.process_requests calls
_retry
if response code listed inretry_http_codes
(fromRETRY_HTTP_CODES
setting). This is not our case because 200 response is not inretry_http_codes
and response needs to pass all middlewares (includingRetryMiddleware
) in order to reach spider callback (and retry request from spider callback).process_exception calls
_retry
if exception argument listed inEXCEPTIONS_TO_RETRY
which is stored as tuple. https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/downloadermiddlewares/retry.py#L34-L37We can use non private
process_exception
method instead of_retry
in spider code after following: 1.define newException
2.add our customException
inEXCEPTION_TO_RETRY
inRetryMiddleware
object 3.callprocess_exception
method with our newexception
as argument Code is slightly different: