Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Offsite middleware doesn't filter redirected responses

See original GitHub issue

Reported by fencer on Trac: http://dev.scrapy.org/ticket/100

Using a BaseSpider to harvest links. The spider evaluates every anchor link on the page, processes them and applies an algorithm to it. The spider’s parse function returns both items for output and request with the harvested links for further crawling.

Extra domains were not specified, only the domain_name value was set in the spider to “agd.org”. In testing the spider, I noticed it was crawling URLs outside the domain_name.

In examining the log file, I noticed that there were 302 redirects from an URL inside the domain to an URL outside the domain. All domains crawled outside of the original domain_name correlated with a 302 redirect.

2009-09-01 12:44:25-0700 [agd.org] DEBUG: Redirecting (302) to 
   <http://www.goarmy.com/amedd/dental/index.jsp?iom=9618-ITBP-MCDE-07012009-16-09021-180AD1> 
   from <http://www.agd.org/adtracking/a.aspx?ZoneID=18&Task=Click&Mode=HTML&SiteID=1&PageID=28659>

I have not examined the SpiderMiddleware in detail, but I am guessing that the 302 redirect is somehow circumventing the scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

Not sure if this is a bug or way it was intentionally designed when handling 302 redirects.

Issue Analytics

State:
Created 12 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

djunzucommented, Oct 3, 2016

I must add this problem is not just with allowed_domains but it also affects deny rules in CrawlSpider.

e.g. rules = Rule(LinkExtractor(deny="never_get"), follow=True) A redirected link can be crawled even when it contains the deny regexp.

0reactions

aliowkacommented, Jan 5, 2015

Sorry for a late response here is the pull request. Happy New Year everybody!

Read more comments on GitHub >

Top Results From Across the Web

URL Rewriting Middleware in ASP.NET Core

Learn about URL rewriting and redirecting with URL Rewriting Middleware in ASP.NET Core applications.

Crawling redirected url in scrapy - python

So the redirected url is not ignored, unless it's from another domain or an already visited url *(filtered by dupe middleware).

Spider Middleware — Scrapy 2.7.1 documentation

OffsiteMiddleware. Filters out Requests for URLs outside the domains covered by the spider. This middleware filters out every request whose ...

8. Router and Filter: Zuul

The filter acts on the Location header of ALL 3XX response codes, which may not be appropriate in all scenarios, such as when...

CWE-601: URL Redirection to Untrusted Site ('Open Redirect')

By modifying the URL value to a malicious site, an attacker may successfully launch a phishing scam and steal user credentials. Because the...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Support limiting the number of requests per interval

Scrapyd does not support spiders that use AsyncioSelectorReactor