question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Give `BaseDupeFilter` access to spider-object

See original GitHub issue

I am in a situation where a single item gets defined over a sequence of multiple pages, passing values between the particular callbacks using the meta-dict. I believe this is a common approach among scrapy-users.

However, it feels like this approach is difficult to get right. With the default implementation of RFPDupefilter, my callback-chain is teared apart quite easy, as fingerprints don’t take the meta-dict into account. The corresponding requests are thrown away, the information in the meta-dict which made this request unique is lost.

I have currently implemented by own meta-aware DupeFilter, but I am still facing the problem that it lacks access to the specific spider in use - and only the Spider really knows the meta-attributes that make a request unique. I could now take it a step further and implement my own scheduler, but I’m afraid that all these custom extensions make my code very brittle wrt future versions of scrapy.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
kmikecommented, Oct 4, 2017

@IAlwaysBeCoding I’m not sure I follow, sorry! Spider is available to Scheduler in from_crawler method which is called when Scheduler is created; a Scheduler subclass can store spider as an attribute. But this doesn’t solve @Chratho’s problem, as he heeds spider in dupefilter, not necessarily in Scheduler.

Adding ‘scheduler’ argument to next_request is a backwards incompatible and could break existing custom schedulers. It also doesn’t provide any new features, as this argument is unused by default, and a Scheduler subclass can already access spider by overriding from_crawler method (as a bonus point, access is not limited to next_request method if spider is saved in from_crawler method).

1reaction
kmikecommented, Oct 4, 2017

Another way to solve the problem is to pass a list of meta keys which should be considered for uniqueness check; a custom dupefilter may look at this list and compute fingerprint accordingly. See also: https://github.com/scrapy/scrapy/issues/900

If we’re to make spider available to dupefilter we should add from_crawler support to dupefilters in addition to from_settings, this is how it is solved for all other Scrapy components.

Read more comments on GitHub >

github_iconTop Results From Across the Web

how to filter duplicate requests based on url in scrapy
I am writing a crawler for a website using scrapy with CrawlSpider. Scrapy provides an in-built duplicate-request filter which filters duplicate requests based ......
Read more >
Settings — Scrapy 2.7.1 documentation
Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the ... Import path of a given asyncio event loop class.
Read more >
Scrapy Documentation - Read the Docs
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide.
Read more >
Scrapy 1.4.0 documentation
Opens the given URL in a browser, as your Scrapy spider would “see” it. ... Spiders can access arguments in their __init__ methods:....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found