question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Offsite middleware doesn't filter redirected responses

See original GitHub issue

Reported by fencer on Trac: http://dev.scrapy.org/ticket/100

Using a BaseSpider to harvest links. The spider evaluates every anchor link on the page, processes them and applies an algorithm to it. The spider’s parse function returns both items for output and request with the harvested links for further crawling.

Extra domains were not specified, only the domain_name value was set in the spider to “agd.org”. In testing the spider, I noticed it was crawling URLs outside the domain_name.

In examining the log file, I noticed that there were 302 redirects from an URL inside the domain to an URL outside the domain. All domains crawled outside of the original domain_name correlated with a 302 redirect.

2009-09-01 12:44:25-0700 [agd.org] DEBUG: Redirecting (302) to 
   <http://www.goarmy.com/amedd/dental/index.jsp?iom=9618-ITBP-MCDE-07012009-16-09021-180AD1> 
   from <http://www.agd.org/adtracking/a.aspx?ZoneID=18&Task=Click&Mode=HTML&SiteID=1&PageID=28659>

I have not examined the SpiderMiddleware in detail, but I am guessing that the 302 redirect is somehow circumventing the scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

Not sure if this is a bug or way it was intentionally designed when handling 302 redirects.

Issue Analytics

  • State:closed
  • Created 12 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
djunzucommented, Oct 3, 2016

I must add this problem is not just with allowed_domains but it also affects deny rules in CrawlSpider.

e.g. rules = Rule(LinkExtractor(deny="never_get"), follow=True) A redirected link can be crawled even when it contains the deny regexp.

0reactions
aliowkacommented, Jan 5, 2015

Sorry for a late response here is the pull request. Happy New Year everybody!

Read more comments on GitHub >

github_iconTop Results From Across the Web

URL Rewriting Middleware in ASP.NET Core
Learn about URL rewriting and redirecting with URL Rewriting Middleware in ASP.NET Core applications.
Read more >
Crawling redirected url in scrapy - python
So the redirected url is not ignored, unless it's from another domain or an already visited url *(filtered by dupe middleware).
Read more >
Spider Middleware — Scrapy 2.7.1 documentation
OffsiteMiddleware. Filters out Requests for URLs outside the domains covered by the spider. This middleware filters out every request whose ...
Read more >
8. Router and Filter: Zuul
The filter acts on the Location header of ALL 3XX response codes, which may not be appropriate in all scenarios, such as when...
Read more >
CWE-601: URL Redirection to Untrusted Site ('Open Redirect')
By modifying the URL value to a malicious site, an attacker may successfully launch a phishing scam and steal user credentials. Because the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found