Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dupefilter skips a request when a page is redirected to itself

See original GitHub issue

Hit this when trying to run a spider against scrapinghub.com: sometimes it responds with 302 moved permanently to scrapinghub.com. Scheduler agrees and tries to schedule another request for scrapinghub.com, but fails because dupefilter already considers it visited.

Maybe dupefilter should only add hosts when the response is not a redirect? And when it is, the scheduler should probably remember the original address too, so that all the redirection chain can be marked as visited.

Issue Analytics

State:
Created 8 years ago
Reactions:4
Comments:20 (13 by maintainers)

Top GitHub Comments

2reactions

victor-torrescommented, Aug 14, 2020

I’ve faced this problem today while writing a spider like this:

spider has a list of credentials
submit a FormRequest to website.com/login for each credential with dont_filter=True
requests are redirected from website.com/login to website.com/profile
only one profile is fetched because of the default dupefilter

Workaround:

Since this Spider was very simple, I’ve just disabled my dupefilter with:

class WebsiteSpider(scrapy.Spider):

    ...

    custom_settings = {
        "DUPEFILTER_CLASS": "scrapy.dupefilters.BaseDupeFilter",
    }

    ....

But it would be very interesting to have a way to disable this behavior like what’s being proposed on #4314 or something like dont_filter_redirects=True.

1reaction

kingnamecommented, Oct 26, 2017

Surely I can and I have done. But I think it is a bug and you should solve it. Why don’t you solve it for almost 3 years?

I know in some situations, the anti-spider system will always redirect the spider to one page to protect their data, so this duplicate filter is reasonable. But are there some smarter method to solve my problem?

Top Results From Across the Web

Scrapy dupefilter filtering redirects even with dont_filter=True

I'm trying to scrape a page that redirects me a few times to itself (bouncing between http and https) before finally responding:

Redirections in HTTP - MDN Web Docs - Mozilla

In HTTP, redirection is triggered by a server sending a special redirect response to a request. Redirect responses have status codes that start ......

The Ultimate Guide to Redirects: URL ...

The user's browser requests the old (redirected) URL. The server automatically displays the webpage for the new URL (the redirect target).

ERR_TOO_MANY_REDIRECTS · Cloudflare SSL/TLS docs

This error occurs when visitors get stuck in a redirect loop. ... if your origin server automatically redirects all HTTP requests to HTTPS....

Release notes — Scrapy 0.24 文档 - 脚本之家在线手册

obey request method when scrapy deploy is redirected to a new endpoint (commit 8c4fcee) ... Added SitemapSpider (see documentation in Spiders page) (r2658)....