question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dupefilter skips a request when a page is redirected to itself

See original GitHub issue

Hit this when trying to run a spider against scrapinghub.com: sometimes it responds with 302 moved permanently to scrapinghub.com. Scheduler agrees and tries to schedule another request for scrapinghub.com, but fails because dupefilter already considers it visited.

Maybe dupefilter should only add hosts when the response is not a redirect? And when it is, the scheduler should probably remember the original address too, so that all the redirection chain can be marked as visited.

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:4
  • Comments:20 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
victor-torrescommented, Aug 14, 2020

I’ve faced this problem today while writing a spider like this:

  • spider has a list of credentials
  • submit a FormRequest to website.com/login for each credential with dont_filter=True
  • requests are redirected from website.com/login to website.com/profile
  • only one profile is fetched because of the default dupefilter

Workaround:

Since this Spider was very simple, I’ve just disabled my dupefilter with:

class WebsiteSpider(scrapy.Spider):

    ...

    custom_settings = {
        "DUPEFILTER_CLASS": "scrapy.dupefilters.BaseDupeFilter",
    }

    ....

But it would be very interesting to have a way to disable this behavior like what’s being proposed on #4314 or something like dont_filter_redirects=True.

1reaction
kingnamecommented, Oct 26, 2017

Surely I can and I have done. But I think it is a bug and you should solve it. Why don’t you solve it for almost 3 years?

I know in some situations, the anti-spider system will always redirect the spider to one page to protect their data, so this duplicate filter is reasonable. But are there some smarter method to solve my problem?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy dupefilter filtering redirects even with dont_filter=True
I'm trying to scrape a page that redirects me a few times to itself (bouncing between http and https) before finally responding:
Read more >
Redirections in HTTP - MDN Web Docs - Mozilla
In HTTP, redirection is triggered by a server sending a special redirect response to a request. Redirect responses have status codes that start ......
Read more >
The Ultimate Guide to Redirects: URL ...
The user's browser requests the old (redirected) URL. The server automatically displays the webpage for the new URL (the redirect target).
Read more >
ERR_TOO_MANY_REDIRECTS · Cloudflare SSL/TLS docs
This error occurs when visitors get stuck in a redirect loop. ... if your origin server automatically redirects all HTTP requests to HTTPS....
Read more >
Release notes — Scrapy 0.24 文档 - 脚本之家在线手册
obey request method when scrapy deploy is redirected to a new endpoint (commit 8c4fcee) ... Added SitemapSpider (see documentation in Spiders page) (r2658)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found