question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy.Request no init error on invalid url

See original GitHub issue

I stumbled on some weird issue, spider got some invalid url, but instead of crashing loudly when trying to create scrapy.Request() with invalid url it just silently ignored this error. Sample to reproduce

from scrapy.spiders import Spider
from scrapy import Request


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    ]

    def parse(self, response):
        invalid_url = "/container.productlist.productslist.productthumbnail.articledetaillink.layerlink:open-layer/0/CLASSIC/-1/WEB$007cARBO$007c13263065/null$007cDisplay$0020Product$002f111499$002fAil$0020blanc$007c?t:ac=13263065"
        yield Request(invalid_url)

this generates following output:

2017-02-09 12:21:04 [scrapy.core.engine] INFO: Spider opened
2017-02-09 12:21:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-09 12:21:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-02-09 12:21:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2017-02-09 12:21:04 [scrapy.core.engine] INFO: Closing spider (finished)

there is no information about trying to generate this Request with invalid_url, no stacktrace, no error info from middleware. Why?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
pawelmhmcommented, Feb 9, 2017

I think this check https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/__init__.py#L56

if ':' not in self._url:
            raise ValueError('Missing scheme in request url: %s' % self._url)

is not good. Every request that contains “:” will pass through, e.g.

In [4]: Request("aa:")
Out[4]: <GET aa:>
0reactions
victor-torrescommented, Oct 22, 2019

@hsumerf I believe the data: URIs are used to download embedded images and other types of files found on the page source code. For example, using the Media Pipeline (Files and Images).

I took the opportunity to open a PR proposing those changes and incrementing a simple regression test case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to get the scrapy failure URLs? - Stack Overflow
In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response . Parameters. url...
Read more >
Use Scrapy to Extract Data From HTML Tags - Linode
This guide will provide you with instructions to build a spider which recursively checks all <a> tags of a website and tracks broken...
Read more >
Scrapy Documentation - Read the Docs
The crawl started by making requests to the URLs defined in the ... you want it to be resilient to errors due to...
Read more >
Errors Reference - Zyte documentation
Check the URL you're passing to Smart Proxy Manager. Contact our support team if you need help. bad_auth. 401. Incorrect authentication data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found