Offsite middleware doesn't filter redirected responses
See original GitHub issueReported by fencer on Trac: http://dev.scrapy.org/ticket/100
Using a BaseSpider
to harvest links. The spider evaluates every anchor link on the page, processes them and applies an algorithm to it. The spider’s parse function returns both items for output and request with the harvested links for further crawling.
Extra domains were not specified, only the domain_name value was set in the spider to “agd.org”. In testing the spider, I noticed it was crawling URLs outside the domain_name.
In examining the log file, I noticed that there were 302 redirects from an URL inside the domain to an URL outside the domain. All domains crawled outside of the original domain_name correlated with a 302 redirect.
2009-09-01 12:44:25-0700 [agd.org] DEBUG: Redirecting (302) to
<http://www.goarmy.com/amedd/dental/index.jsp?iom=9618-ITBP-MCDE-07012009-16-09021-180AD1>
from <http://www.agd.org/adtracking/a.aspx?ZoneID=18&Task=Click&Mode=HTML&SiteID=1&PageID=28659>
I have not examined the SpiderMiddleware in detail, but I am guessing that the 302 redirect is somehow circumventing the scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware
Not sure if this is a bug or way it was intentionally designed when handling 302 redirects.
Issue Analytics
- State:
- Created 12 years ago
- Comments:8 (7 by maintainers)
Top GitHub Comments
I must add this problem is not just with
allowed_domains
but it also affectsdeny
rules inCrawlSpider
.e.g.
rules = Rule(LinkExtractor(deny="never_get"), follow=True)
A redirected link can be crawled even when it contains thedeny
regexp.Sorry for a late response here is the pull request. Happy New Year everybody!