allowed_domains bug/undesired behaviour
See original GitHub issueAssume crawler have set allowed_domains to below list:
self.allowed_domains = ['albert.zgora.pl']
Scrapy shouldn’t go beyond ‘albert.zgora.pl’ domain.
This is just one real life example, but there are many more (and I can give them here if you want) where domain string appears somewhere in url e.g. in &url=
parameter.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:7 (4 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
see also: https://github.com/scrapy/scrapy/issues/2241
I think this problem usually happens due to
RedirectMiddleware
.Full repro:
Actual:
Expected:
I thought, like in the first post, that it’s due to
(because that also happens to be true in my case) but it’s just a coincidence. I’m quite sure that the logic is sound.
The real issue probably is that the check whether the download is allowed happens without any involvement of
RedirectMiddleware
but the actual download is of the redirected URL.