dupefilter skips a request when a page is redirected to itself
See original GitHub issueHit this when trying to run a spider against scrapinghub.com: sometimes it responds with 302 moved permanently to scrapinghub.com
. Scheduler agrees and tries to schedule another request for scrapinghub.com, but fails because dupefilter already considers it visited.
Maybe dupefilter should only add hosts when the response is not a redirect? And when it is, the scheduler should probably remember the original address too, so that all the redirection chain can be marked as visited.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:4
- Comments:20 (13 by maintainers)
Top Results From Across the Web
Scrapy dupefilter filtering redirects even with dont_filter=True
I'm trying to scrape a page that redirects me a few times to itself (bouncing between http and https) before finally responding:
Read more >Redirections in HTTP - MDN Web Docs - Mozilla
In HTTP, redirection is triggered by a server sending a special redirect response to a request. Redirect responses have status codes that start ......
Read more >The Ultimate Guide to Redirects: URL ...
The user's browser requests the old (redirected) URL. The server automatically displays the webpage for the new URL (the redirect target).
Read more >ERR_TOO_MANY_REDIRECTS · Cloudflare SSL/TLS docs
This error occurs when visitors get stuck in a redirect loop. ... if your origin server automatically redirects all HTTP requests to HTTPS....
Read more >Release notes — Scrapy 0.24 文档 - 脚本之家在线手册
obey request method when scrapy deploy is redirected to a new endpoint (commit 8c4fcee) ... Added SitemapSpider (see documentation in Spiders page) (r2658)....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’ve faced this problem today while writing a spider like this:
FormRequest
to website.com/login for each credential withdont_filter=True
Workaround:
Since this Spider was very simple, I’ve just disabled my dupefilter with:
But it would be very interesting to have a way to disable this behavior like what’s being proposed on #4314 or something like
dont_filter_redirects=True
.Surely I can and I have done. But I think it is a bug and you should solve it. Why don’t you solve it for almost 3 years?
I know in some situations, the anti-spider system will always redirect the spider to one page to protect their data, so this duplicate filter is reasonable. But are there some smarter method to solve my problem?