Using proxy through http fails (https works)
See original GitHub issueDescription
When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b''
error
As I see, most of the people have this issue the other way around.
Steps to Reproduce
- Scrape a http link with proxy
Expected behavior: Get a 200 with the desired data.
Actual behavior:
ERROR: Error downloading <GET http://*********>
Traceback (most recent call last):
File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
return handler.download_request(request, spider)
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 82, in download_request
return agent.download_request(request)
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 361, in download_request
d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 262, in request
endpoint=self._getEndpoint(self._proxyURI),
File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1729, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1607, in endpointForURI
raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''
Reproduces how often: Every time I scrape with proxy
Versions
Scrapy : 2.0.1
lxml : 4.4.1.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 20.3.0
Python : 3.7.3 (default, Apr 3 2019, 05:39:12) - [GCC 8.3.0]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1c 28 May 2019)
cryptography : 2.7
Platform : Linux-4.19.0-5-amd64-x86_64-with-debian-10.0
Additional context
I tried to add some breakpoints at the end to see where it cracks. I added the following lines in “twisted/web/client/py”, before the cracking point:
endpoint = HostnameEndpoint(self._reactor, host, uri.port, **kwargs)
import logging
logger = logging.getLogger(__name__)
logger.error("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
logger.error(uri)
logger.error(uri.host)
logger.error(uri.port)
logger.error(uri.scheme)
logger.error(dir(uri))
if uri.scheme == b'http':
return endpoint
elif uri.scheme == b'https':
connectionCreator = self._policyForHTTPS.creatorForNetloc(uri.host,
uri.port)
return wrapClientTLS(connectionCreator, endpoint)
else:
raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
Apparently in this point there is no schema. If I run the same code with a https url, this code is never reached. It seems that getting up to point there is bad and the proxy is not used
(edited to apply formatting)
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
@Gallaecio I would like to contribute , I will start this as my first open source contribution But I may need some help from you
I guess we can take this as an enhancement to support schema-less HTTP proxy URLs.
I checked, and there is no bug, the logic to handle HTTP and HTTPS proxies is different, and the HTTPS one is implemented in a way that the schema is not needed in the proxy URL.
As a reference for people wishing to work on this, it should be as simple as modifying
ScrapyProxyAgent.request
so that the URL parameter passed toself._getEndpoint
is ensured to havehttp://
as schema. Parsing the URL, setting the schema and then unparsing should do the job (https://docs.python.org/3/library/urllib.parse.html).