Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ability to retry a request from inside a spider callback

See original GitHub issue

There are situations where websites return 200 responses but the content is not available due to bans or temporal issues which can be fixed by retrying requests.

There should be an easier way to retry requests inside spider callbacks, which should ideally reuse the code in Retry downloader middleware.

I see two approaches for this.

Introduce new exception called RetryRequest which can be raised inside a spider callback to indicate a retry. I personally prefer this but the implementation of this is a little untidy due to this bug #220
```
from scrapy.exceptions import RetryRequest

def parse(self, response):
    if response.xpath('//title[text()="Content not found"]'):
        raise RetryRequest('Missing content')
```

Introduce a new class RetryRequest which wraps a request that needs to be retried. A RetryRequest can be yielded from a spider callback to indicate a retry

from scrapy.http import RetryRequest

def parse(self, response):
    if response.xpath('//title[text()="Content not found"]'):
        yield RetryRequest(response.request, reason='Missing content')

Will be sending two PRs for the two approaches. Happy to hear about any other alternatives too.

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:20 (12 by maintainers)

Top GitHub Comments

13reactions

kasuncommented, Jan 21, 2019

@Gallaecio I miss everything the retry middleware currently does.

You can’t infinitely do this. you need to count the retries and give up after a certain retry attempts You need to have stats and log messages

IMO there should be a easier way to do above things than re-implement the same code in every project/spider.

3reactions

GeorgeA92commented, Feb 21, 2019

@ejulio I agree. There are 2 non private methods where _retry method called.

process_requests calls _retry if response code listed in retry_http_codes (from RETRY_HTTP_CODES setting). This is not our case because 200 response is not in retry_http_codes and response needs to pass all middlewares (including RetryMiddleware) in order to reach spider callback (and retry request from spider callback).

process_exception calls _retry if exception argument listed in EXCEPTIONS_TO_RETRY which is stored as tuple. https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/downloadermiddlewares/retry.py#L34-L37

We can use non private process_exception method instead of _retry in spider code after following: 1.define new Exception 2.add our custom Exception in EXCEPTION_TO_RETRY in RetryMiddleware object 3.call process_exception method with our new exception as argument Code is slightly different:

class ContentNotFoundException(Exception):
    """
    ContentNotFound
    """

class Myspider(scrapy.Spider):
.....
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.RetryMiddleware = [middleware for middleware in downloader_middlewares if "RetryMiddleware" in str(type(middleware))][0]
	self.RetryMiddleware.EXCEPTIONS_TO_RETRY = tuple(list(self.RetryMiddleware.EXCEPTIONS_TO_RETRY)+[ContentNotFoundException])
.....

    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield self.RetryMiddleware.process_exception(response.request, ContentNotFoundException(), self)
            return