Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation example fails with `proxy URL with no authority`

See original GitHub issue

Running the example from the documentation yields this:

10:11 $ scrapy runspider quotes.py 
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 0.9.1, parsel 1.5.0, w3lib 1.19.0, Twisted 16.0.0, Python 2.7.12 (default, Dec  4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016), cryptography 1.2.3, Platform Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial
2018-07-11 10:12:04 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-11 10:12:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2018-07-11 10:12:04 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 88, in run
    self.crawler_process.crawl(spidercls, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 171, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 175, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 29, in from_crawler
    return cls(auth_encoding)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 22, in __init__
    self.proxies[type] = self._get_proxy(url, type)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 39, in _get_proxy
    proxy_type, user, password, hostport = _parse_proxy(url)
  File "/usr/lib/python2.7/urllib2.py", line 721, in _parse_proxy
    raise ValueError("proxy URL with no authority: %r" % proxy)
exceptions.ValueError: proxy URL with no authority: '/var/run/docker.sock'
2018-07-11 10:12:04 [twisted] CRITICAL:

Looks like proxy code does not handle no_proxy correctly.

Issue Analytics

State:
Created 5 years ago
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

a-palchikovcommented, Sep 4, 2020

I guess NO_PROXY handling is very open to specific interpretations and is not standardized. Docker client describes the uses of NO_PROXY for its purposes here while scrapy can just ignore the proxy that the urllib2.parse_proxy fails to parse.

0reactions

drs-11commented, Sep 4, 2020

I’m not sure what could be a solution to this issue. /var/run/docker.sock seems the only exception for having a socket file in a no_proxy env variable. So either the socket file be ignored and not added to the list of proxies or maybe add it without passing the socket file path to _get_proxy method which is causing the error?

But the second option will cause further errors when the proxy is parsed in other modules. So I think ignoring the socket file will be the best option? Also I can’t find any other cases where a socket file is used in no_proxy. Thoughts?