[question] Why adding `dont_filter=True` for requests made from `start_urls`
See original GitHub issueIt’s observed that currently (as of b364d27) in scrapy.Spider.start_requests
the generated requests have dont_filter=True
. (related line of code: link)
As I’ve had a quick look through the history, this behavior of adding dont_filter=True
to initial requests seems to have been introduced in the very initial commit of 83dcf8a (related line of code: link).
At least from my personal perspective of view, there seems to be no reasonable purpose for this specific behavior.
Could anyone help explaining the design and resolving my confusion? Or, if there’s no such reason (or the previous reasons do not hold meaningful now), what about removing the dont_filter=True
attribute for initial requests?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:7 (5 by maintainers)
Top Results From Across the Web
How does adding dont_filter=True argument in scrapy ...
Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default.
Read more >Requests and Responses — Scrapy 2.7.1 documentation
Both Request and Response classes have subclasses which add functionality not required in the base classes. These are described below in ...
Read more >Response Requirements & Validation - Qualtrics
Request response: Alerts the respondent about any unanswered questions, but allows the respondent to continue the survey without answering if they choose. Add...
Read more >Show form questions based on answers - Google Help
Set up a Google Forms survey so that people only see certain sections based on their answers. Examples. Create screening questions, then show...
Read more >To Use or Not Use the "?" in Requests - Merriam-Webster
To some, a request made with a period at the end (like 'Can you do this please ... of a question when stating...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just re-read my old comment 😃 The problem with this change is that if we remove dont_filter=True from start_requests and do nothing else, spiders like this would stop working:
(haven’t re-checked it, but all the related issues are still open)
Also in favor of this, specially on imported URL’s from a file or a huge set of them with potential duplicates.