Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Why adding `dont_filter=True` for requests made from `start_urls`

See original GitHub issue

It’s observed that currently (as of b364d27) in scrapy.Spider.start_requests the generated requests have dont_filter=True. (related line of code: link)

As I’ve had a quick look through the history, this behavior of adding dont_filter=True to initial requests seems to have been introduced in the very initial commit of 83dcf8a (related line of code: link).

At least from my personal perspective of view, there seems to be no reasonable purpose for this specific behavior.

Could anyone help explaining the design and resolving my confusion? Or, if there’s no such reason (or the previous reasons do not hold meaningful now), what about removing the dont_filter=True attribute for initial requests?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

kmikecommented, Feb 6, 2020

Just re-read my old comment 😃 The problem with this change is that if we remove dont_filter=True from start_requests and do nothing else, spiders like this would stop working:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ['']

    def parse(self, response):
        # ... parse is never called!

(haven’t re-checked it, but all the related issues are still open)

MartiONEcommented, Feb 6, 2020

Also in favor of this, specially on imported URL’s from a file or a huge set of them with potential duplicates.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How does adding dont_filter=True argument in scrapy ...
Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default.
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
Both Request and Response classes have subclasses which add functionality not required in the base classes. These are described below in ...
Read more >
Response Requirements & Validation - Qualtrics
Request response: Alerts the respondent about any unanswered questions, but allows the respondent to continue the survey without answering if they choose. Add...
Read more >
Show form questions based on answers - Google Help
Set up a Google Forms survey so that people only see certain sections based on their answers. Examples. Create screening questions, then show...
Read more >
To Use or Not Use the "?" in Requests - Merriam-Webster
To some, a request made with a period at the end (like 'Can you do this please ... of a question when stating...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found