make offsite middleware downloader middleware instead of spider middleware
See original GitHub issueCurrently offsite middleware is spider middleware. It works only on spider output, only processes requests generated in spider callback. But requests in Scrapy can be scheduled outside spider callback. This can happen in following cases:
- requests scheduled on
start_requests
- requests scheduled with
crawler.engine.crawl()
,crawler.engine.download()
,crawler.engine.schedule
Second case (downloading with crawler.engine.crawl()) is very common in my experience. Many spiders get lists of urls from some external source and schedule requests on spider_idle
signal. Offsite middleware wont work properly for them, allowed_domains attribute will not have effect.
I think we should make offsite middleware downloader middleware and make it work for all requests or at least discuss and clarify why we want to keep it as spider middleware and document that it does not work on all possible requests.
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (8 by maintainers)
Newcomer here 😃
Following up on my initial (naive) comments in https://github.com/scrapy/scrapy/issues/3217
Sorry if I may be vague and not adding specific code and PR (yet), but having read some more comments and hitting the problem as a “very” unexpected side effect of hitting URL’s outside of my strict “allowed_domains”, it seems to me that a possible solution could be:
We do NOT change the behavior of the current
scrapy.spidermiddlewares.offsite.OffsiteMiddleware
:FOR PERFORMANCE: it will stop processing outgoing links that are offsite as early as possible
scrapy.downloadermiddlewares.redirect.RedirectMiddleware
:FOR COMPLETENESS: it will follow redirects (but in 1% of cases, these may actually be offsite, which is the problem in user expectation that we are discussing here)
FOR BACKWARDS COMPATIBILITY
We DO add:
scrapy.downloadermiddlewares.offsite.OffsiteMiddleware
:FOR CORRECTNESS: when activated, this will guarantee that we strictly never hit a URL outside of our list of
allowed_domains
.mentioning the historic “problem” and how to optionally add this middleware near the bottom of the downloader middlewares stack to avoid it.
Implementation:
scrapy.spidermiddlewares.offsite.OffsiteMiddleware
and the newscrapy.downloadermiddlewares.offsite.OffsiteMiddleware
. But that should be a solvable problem to do implement this in a DRY fashion (duplicate functionality, but using the same code).Naively … could the
scrapy.downloadermiddlewares.offsite.OffsiteMiddleware
“simply” be a “shim” that callsscrapy.spidermiddlewares.offsite.OffsiteMiddleware
with the correct arguments?Longer term:
Design:
I do think we need the offsite filtering on 2 places:
Technically, this could also be achieved with an additional check in the RedirectMiddleware, but then we would be changing existing functionality of that middleware …
HTH …
I agree. Offsite middleware doesn’t seem to benefit from being spider middleware at all. It doesn’t need access to
response
, it only needs to check generated requests url attribute.My only concern is that making this a downloader middleware will pollute the scheduler with obviously unwanted requests. This could probably affect lazy broad crawl spiders that just yield every url on page and the offsite middleware deal with it.
Does anyone/thing care about scheduler? AFAIK all of the limits and other settings are actually resolved on the downloader, right? (except for depth limit which is a spider middleware)
Alternative solutions:
crawler.engine.crawl()
, because right now it’s a bit tedious.Personally I don’t like any of the alternative solutions and if my raised issue regarding scheduler is not significant, I would gladly have offsite spider middleware as a downloader middleware instead.