question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation example fails with `proxy URL with no authority`

See original GitHub issue

Running the example from the documentation yields this:

10:11 $ scrapy runspider quotes.py 
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 0.9.1, parsel 1.5.0, w3lib 1.19.0, Twisted 16.0.0, Python 2.7.12 (default, Dec  4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016), cryptography 1.2.3, Platform Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial
2018-07-11 10:12:04 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-11 10:12:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2018-07-11 10:12:04 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 88, in run
    self.crawler_process.crawl(spidercls, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 171, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 175, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 29, in from_crawler
    return cls(auth_encoding)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 22, in __init__
    self.proxies[type] = self._get_proxy(url, type)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 39, in _get_proxy
    proxy_type, user, password, hostport = _parse_proxy(url)
  File "/usr/lib/python2.7/urllib2.py", line 721, in _parse_proxy
    raise ValueError("proxy URL with no authority: %r" % proxy)
exceptions.ValueError: proxy URL with no authority: '/var/run/docker.sock'
2018-07-11 10:12:04 [twisted] CRITICAL:

Looks like proxy code does not handle no_proxy correctly.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
a-palchikovcommented, Sep 4, 2020

I guess NO_PROXY handling is very open to specific interpretations and is not standardized. Docker client describes the uses of NO_PROXY for its purposes here while scrapy can just ignore the proxy that the urllib2.parse_proxy fails to parse.

0reactions
drs-11commented, Sep 4, 2020

I’m not sure what could be a solution to this issue. /var/run/docker.sock seems the only exception for having a socket file in a no_proxy env variable. So either the socket file be ignored and not added to the list of proxies or maybe add it without passing the socket file path to _get_proxy method which is causing the error?

But the second option will cause further errors when the proxy is parsed in other modules. So I think ignoring the socket file will be the best option? Also I can’t find any other cases where a socket file is used in no_proxy. Thoughts?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Solve Proxy Error Codes - The Ultimate Guide!
A proxy error is an HTTP error status that you will receive as a response when a request sent to the web server...
Read more >
Proxy Error Meaning - http Status Codes - Bright Data
A 401 error code means you are not authorized to access the target site, and that is why the page will not load....
Read more >
JIRA GUI is not rendered properly when accessed via Proxy ...
JIRA GUI is not rendered properly when accessed via Proxy URL ... Failed to create requestUri // due to: Expected authority at index...
Read more >
HTTP 407 proxy authentication error when calling a web service
This works fine if the proxy server is turned off, or the URL has been whitelisted as not requiring authentication, but as soon...
Read more >
Troubleshooting Web Application Proxy | Microsoft Learn
The admin must make sure that no one binds to the same URLs. To check this, run the command: netsh http show urlacl....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found