question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

make offsite middleware downloader middleware instead of spider middleware

See original GitHub issue

Currently offsite middleware is spider middleware. It works only on spider output, only processes requests generated in spider callback. But requests in Scrapy can be scheduled outside spider callback. This can happen in following cases:

  • requests scheduled on start_requests
  • requests scheduled with crawler.engine.crawl(), crawler.engine.download(), crawler.engine.schedule

Second case (downloading with crawler.engine.crawl()) is very common in my experience. Many spiders get lists of urls from some external source and schedule requests on spider_idle signal. Offsite middleware wont work properly for them, allowed_domains attribute will not have effect.

I think we should make offsite middleware downloader middleware and make it work for all requests or at least discuss and clarify why we want to keep it as spider middleware and document that it does not work on all possible requests.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
petervandenabeelecommented, Jan 26, 2020

Newcomer here 😃

Following up on my initial (naive) comments in https://github.com/scrapy/scrapy/issues/3217

Sorry if I may be vague and not adding specific code and PR (yet), but having read some more comments and hitting the problem as a “very” unexpected side effect of hitting URL’s outside of my strict “allowed_domains”, it seems to me that a possible solution could be:

We do NOT change the behavior of the current

  • scrapy.spidermiddlewares.offsite.OffsiteMiddleware:
    FOR PERFORMANCE: it will stop processing outgoing links that are offsite as early as possible
  • scrapy.downloadermiddlewares.redirect.RedirectMiddleware:
    FOR COMPLETENESS: it will follow redirects (but in 1% of cases, these may actually be offsite, which is the problem in user expectation that we are discussing here)
  • the current list and order of default middlewares:
    FOR BACKWARDS COMPATIBILITY

We DO add:

  • a new scrapy.downloadermiddlewares.offsite.OffsiteMiddleware:
    FOR CORRECTNESS: when activated, this will guarantee that we strictly never hit a URL outside of our list of allowed_domains.
  • a note in the documentation:
    mentioning the historic “problem” and how to optionally add this middleware near the bottom of the downloader middlewares stack to avoid it.

Implementation:

  • from the implementation there will probably be a lot of duplicated functionality between the existing scrapy.spidermiddlewares.offsite.OffsiteMiddleware and the new scrapy.downloadermiddlewares.offsite.OffsiteMiddleware. But that should be a solvable problem to do implement this in a DRY fashion (duplicate functionality, but using the same code).
    Naively … could the scrapy.downloadermiddlewares.offsite.OffsiteMiddleware “simply” be a “shim” that calls scrapy.spidermiddlewares.offsite.OffsiteMiddleware with the correct arguments?

Longer term:

  • If and when the behavior is stable, we could add this middleware also to the default list of downloadermiddlewares and have a flag or setting “STRICT_OFFSITE_FILTERING”, which, when turned on, would do this thing.
  • The binary (non-backwards-compatible) decision in the long run is then to propose that “STRICT_OFFSITE_FILTERING” becomes “True” by default.

Design:

I do think we need the offsite filtering on 2 places:

  • “early” in the spider, to drop that unneeded work ASAP (for performance)
  • “early” ~“late”~ in the downloader, to be 100% sure that we never ever hit URI’s outside of the allowed_domains (for correctness)

Technically, this could also be achieved with an additional check in the RedirectMiddleware, but then we would be changing existing functionality of that middleware …

HTH …

2reactions
Granitosauruscommented, Sep 15, 2016

I agree. Offsite middleware doesn’t seem to benefit from being spider middleware at all. It doesn’t need access to response, it only needs to check generated requests url attribute.

My only concern is that making this a downloader middleware will pollute the scheduler with obviously unwanted requests. This could probably affect lazy broad crawl spiders that just yield every url on page and the offsite middleware deal with it.
Does anyone/thing care about scheduler? AFAIK all of the limits and other settings are actually resolved on the downloader, right? (except for depth limit which is a spider middleware)

Alternative solutions:

  • Have both spider and downloader offsite middlewares, though it would really ugly redundancy.
  • Make offsite middleware callable for cases when people use crawler.engine.crawl(), because right now it’s a bit tedious.

Personally I don’t like any of the alternative solutions and if my raised issue regarding scheduler is not significant, I would gladly have offsite spider middleware as a downloader middleware instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spider Middleware — Scrapy 2.7.1 documentation
The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the ...
Read more >
What is the difference between Scrapy's spider middleware ...
Downloader middlewares modify requests and responses or generate requests in response to responses. They don't directly interact with spiders. Some examples are ...
Read more >
Spider Middleware - Scrapy documentation - Read the Docs
To activate a spider middleware component, add it to the SPIDER_MIDDLEWARES setting, which is a dict whose keys are the middleware class path...
Read more >
Website Scraping with Python - Machine Learning - 24
DownloaderStats '] 2018-02-11 13:52:20 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware' ...
Read more >
Settings — Scrapy documentation
The settings attribute is set in the base Spider class after the spider is ... For more info see Activating a downloader middleware....
Read more >

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found