Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy.Request no init error on invalid url

See original GitHub issue

I stumbled on some weird issue, spider got some invalid url, but instead of crashing loudly when trying to create scrapy.Request() with invalid url it just silently ignored this error. Sample to reproduce

from scrapy.spiders import Spider
from scrapy import Request


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    ]

    def parse(self, response):
        invalid_url = "/container.productlist.productslist.productthumbnail.articledetaillink.layerlink:open-layer/0/CLASSIC/-1/WEB$007cARBO$007c13263065/null$007cDisplay$0020Product$002f111499$002fAil$0020blanc$007c?t:ac=13263065"
        yield Request(invalid_url)

this generates following output:

2017-02-09 12:21:04 [scrapy.core.engine] INFO: Spider opened
2017-02-09 12:21:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-09 12:21:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-02-09 12:21:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2017-02-09 12:21:04 [scrapy.core.engine] INFO: Closing spider (finished)

there is no information about trying to generate this Request with invalid_url, no stacktrace, no error info from middleware. Why?

Issue Analytics

State:
Created 7 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

1reaction

pawelmhmcommented, Feb 9, 2017

I think this check https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/__init__.py#L56

if ':' not in self._url:
            raise ValueError('Missing scheme in request url: %s' % self._url)

is not good. Every request that contains “:” will pass through, e.g.

In [4]: Request("aa:")
Out[4]: <GET aa:>

0reactions

victor-torrescommented, Oct 22, 2019

@hsumerf I believe the data: URIs are used to download embedded images and other types of files found on the page source code. For example, using the Media Pipeline (Files and Images).

I took the opportunity to open a PR proposing those changes and incrementing a simple regression test case.