Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Why adding `dont_filter=True` for requests made from `start_urls`

See original GitHub issue

It’s observed that currently (as of b364d27) in scrapy.Spider.start_requests the generated requests have dont_filter=True. (related line of code: link)

As I’ve had a quick look through the history, this behavior of adding dont_filter=True to initial requests seems to have been introduced in the very initial commit of 83dcf8a (related line of code: link).

At least from my personal perspective of view, there seems to be no reasonable purpose for this specific behavior.

Could anyone help explaining the design and resolving my confusion? Or, if there’s no such reason (or the previous reasons do not hold meaningful now), what about removing the dont_filter=True attribute for initial requests?

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

kmikecommented, Feb 6, 2020

Just re-read my old comment 😃 The problem with this change is that if we remove dont_filter=True from start_requests and do nothing else, spiders like this would stop working:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ['http://scrapy.org']

    def parse(self, response):
        # ... parse is never called!

(haven’t re-checked it, but all the related issues are still open)

1reaction

MartiONEcommented, Feb 6, 2020

Also in favor of this, specially on imported URL’s from a file or a huge set of them with potential duplicates.

Top Results From Across the Web

How does adding dont_filter=True argument in scrapy ...

Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default.

Requests and Responses — Scrapy 2.7.1 documentation

Both Request and Response classes have subclasses which add functionality not required in the base classes. These are described below in ...

Response Requirements & Validation - Qualtrics

Request response: Alerts the respondent about any unanswered questions, but allows the respondent to continue the survey without answering if they choose. Add...

Show form questions based on answers - Google Help

Set up a Google Forms survey so that people only see certain sections based on their answers. Examples. Create screening questions, then show...

To Use or Not Use the "?" in Requests - Merriam-Webster

To some, a request made with a period at the end (like 'Can you do this please ... of a question when stating...