Give `BaseDupeFilter` access to spider-object
See original GitHub issueI am in a situation where a single item gets defined over a sequence of multiple pages, passing values between the particular callbacks using the meta
-dict. I believe this is a common approach among scrapy-users.
However, it feels like this approach is difficult to get right. With the default implementation of RFPDupefilter
, my callback-chain is teared apart quite easy, as fingerprints don’t take the meta-dict into account. The corresponding requests are thrown away, the information in the meta-dict which made this request unique is lost.
I have currently implemented by own meta-aware DupeFilter, but I am still facing the problem that it lacks access to the specific spider in use - and only the Spider really knows the meta-attributes that make a request unique. I could now take it a step further and implement my own scheduler, but I’m afraid that all these custom extensions make my code very brittle wrt future versions of scrapy.
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
@IAlwaysBeCoding I’m not sure I follow, sorry! Spider is available to Scheduler in from_crawler method which is called when Scheduler is created; a Scheduler subclass can store spider as an attribute. But this doesn’t solve @Chratho’s problem, as he heeds spider in dupefilter, not necessarily in Scheduler.
Adding ‘scheduler’ argument to next_request is a backwards incompatible and could break existing custom schedulers. It also doesn’t provide any new features, as this argument is unused by default, and a Scheduler subclass can already access spider by overriding from_crawler method (as a bonus point, access is not limited to next_request method if spider is saved in from_crawler method).
Another way to solve the problem is to pass a list of meta keys which should be considered for uniqueness check; a custom dupefilter may look at this list and compute fingerprint accordingly. See also: https://github.com/scrapy/scrapy/issues/900
If we’re to make spider available to dupefilter we should add from_crawler support to dupefilters in addition to from_settings, this is how it is solved for all other Scrapy components.