scrapy.Request no init error on invalid url
See original GitHub issueI stumbled on some weird issue, spider got some invalid url, but instead of crashing loudly when trying to create scrapy.Request() with invalid url it just silently ignored this error. Sample to reproduce
from scrapy.spiders import Spider
from scrapy import Request
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
]
def parse(self, response):
invalid_url = "/container.productlist.productslist.productthumbnail.articledetaillink.layerlink:open-layer/0/CLASSIC/-1/WEB$007cARBO$007c13263065/null$007cDisplay$0020Product$002f111499$002fAil$0020blanc$007c?t:ac=13263065"
yield Request(invalid_url)
this generates following output:
2017-02-09 12:21:04 [scrapy.core.engine] INFO: Spider opened
2017-02-09 12:21:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-09 12:21:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-02-09 12:21:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2017-02-09 12:21:04 [scrapy.core.engine] INFO: Closing spider (finished)
there is no information about trying to generate this Request with invalid_url, no stacktrace, no error info from middleware. Why?
Issue Analytics
- State:
- Created 7 years ago
- Comments:12 (9 by maintainers)
Top Results From Across the Web
How to get the scrapy failure URLs? - Stack Overflow
In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to...
Read more >Requests and Responses — Scrapy 2.7.1 documentation
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response . Parameters. url...
Read more >Use Scrapy to Extract Data From HTML Tags - Linode
This guide will provide you with instructions to build a spider which recursively checks all <a> tags of a website and tracks broken...
Read more >Scrapy Documentation - Read the Docs
The crawl started by making requests to the URLs defined in the ... you want it to be resilient to errors due to...
Read more >Errors Reference - Zyte documentation
Check the URL you're passing to Smart Proxy Manager. Contact our support team if you need help. bad_auth. 401. Incorrect authentication data.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think this check https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/__init__.py#L56
is not good. Every request that contains “:” will pass through, e.g.
@hsumerf I believe the
data:
URIs are used to download embedded images and other types of files found on the page source code. For example, using the Media Pipeline (Files and Images).I took the opportunity to open a PR proposing those changes and incrementing a simple regression test case.