question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ability to retry a request from inside a spider callback

See original GitHub issue

There are situations where websites return 200 responses but the content is not available due to bans or temporal issues which can be fixed by retrying requests.

There should be an easier way to retry requests inside spider callbacks, which should ideally reuse the code in Retry downloader middleware.

I see two approaches for this.

  1. Introduce new exception called RetryRequest which can be raised inside a spider callback to indicate a retry. I personally prefer this but the implementation of this is a little untidy due to this bug #220

    from scrapy.exceptions import RetryRequest
    
    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            raise RetryRequest('Missing content')
    
  2. Introduce a new class RetryRequest which wraps a request that needs to be retried. A RetryRequest can be yielded from a spider callback to indicate a retry

    from scrapy.http import RetryRequest
    
    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield RetryRequest(response.request, reason='Missing content')
    

Will be sending two PRs for the two approaches. Happy to hear about any other alternatives too.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:7
  • Comments:20 (12 by maintainers)

github_iconTop GitHub Comments

13reactions
kasuncommented, Jan 21, 2019

@Gallaecio I miss everything the retry middleware currently does.

You can’t infinitely do this. you need to count the retries and give up after a certain retry attempts You need to have stats and log messages

IMO there should be a easier way to do above things than re-implement the same code in every project/spider.

3reactions
GeorgeA92commented, Feb 21, 2019

@ejulio I agree. There are 2 non private methods where _retry method called.

process_requests calls _retry if response code listed in retry_http_codes (from RETRY_HTTP_CODES setting). This is not our case because 200 response is not in retry_http_codes and response needs to pass all middlewares (including RetryMiddleware) in order to reach spider callback (and retry request from spider callback).

process_exception calls _retry if exception argument listed in EXCEPTIONS_TO_RETRY which is stored as tuple. https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/downloadermiddlewares/retry.py#L34-L37

We can use non private process_exception method instead of _retry in spider code after following: 1.define new Exception 2.add our custom Exception in EXCEPTION_TO_RETRY in RetryMiddleware object 3.call process_exception method with our new exception as argument Code is slightly different:

class ContentNotFoundException(Exception):
    """
    ContentNotFound
    """

class Myspider(scrapy.Spider):
.....
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.RetryMiddleware = [middleware for middleware in downloader_middlewares if "RetryMiddleware" in str(type(middleware))][0]
	self.RetryMiddleware.EXCEPTIONS_TO_RETRY = tuple(list(self.RetryMiddleware.EXCEPTIONS_TO_RETRY)+[ContentNotFoundException])
.....

    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield self.RetryMiddleware.process_exception(response.request, ContentNotFoundException(), self)
            return
Read more comments on GitHub >

github_iconTop Results From Across the Web

Retrying a Scrapy Request even when receiving a 200 ...
Follow the EAFP principle: Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid ...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request ......
Read more >
Scrapy - Requests and Responses
The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns...
Read more >
Intro to Web Scraping With Scrapy
change received response before passing it to a spider; retry a request if the response doesn't contain the correct data instead of passing...
Read more >
crawler
Crawler is a web spider written with Nodejs. ... Basic request options; Callbacks; Schedule options; Retry options; Server-side DOM options ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found