Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using proxy through http fails (https works)

See original GitHub issue

Description

When I scrape without proxy, both https and http urls work. Using proxy through https works just fine. My problem is when I try http urls. In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

Scrape a http link with proxy

Expected behavior: Get a 200 with the desired data.

Actual behavior:

ERROR: Error downloading <GET http://*********>
Traceback (most recent call last):
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
    return handler.download_request(request, spider)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 82, in download_request
    return agent.download_request(request)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 361, in download_request
    d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 262, in request
    endpoint=self._getEndpoint(self._proxyURI),
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1729, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1607, in endpointForURI
    raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''

Reproduces how often: Every time I scrape with proxy

Versions

Scrapy       : 2.0.1
lxml         : 4.4.1.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.7.3 (default, Apr  3 2019, 05:39:12) - [GCC 8.3.0]
pyOpenSSL    : 19.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Linux-4.19.0-5-amd64-x86_64-with-debian-10.0

Additional context

I tried to add some breakpoints at the end to see where it cracks. I added the following lines in “twisted/web/client/py”, before the cracking point:

        endpoint = HostnameEndpoint(self._reactor, host, uri.port, **kwargs)
        import logging
        logger = logging.getLogger(__name__)
        logger.error("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
        logger.error(uri)
        logger.error(uri.host)
        logger.error(uri.port)
        logger.error(uri.scheme)
        logger.error(dir(uri))
        if uri.scheme == b'http':
            return endpoint
        elif uri.scheme == b'https':
            connectionCreator = self._policyForHTTPS.creatorForNetloc(uri.host,
                                                                      uri.port)
            return wrapClientTLS(connectionCreator, endpoint)
        else:
            raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))

Apparently in this point there is no schema. If I run the same code with a https url, this code is never reached. It seems that getting up to point there is bad and the proxy is not used

(edited to apply formatting)

Issue Analytics

State:
Created 3 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

liveprasadcommented, Apr 24, 2020

@Gallaecio I would like to contribute , I will start this as my first open source contribution But I may need some help from you

1reaction

Gallaeciocommented, Apr 23, 2020

I guess we can take this as an enhancement to support schema-less HTTP proxy URLs.

I checked, and there is no bug, the logic to handle HTTP and HTTPS proxies is different, and the HTTPS one is implemented in a way that the schema is not needed in the proxy URL.

As a reference for people wishing to work on this, it should be as simple as modifying ScrapyProxyAgent.request so that the URL parameter passed to self._getEndpoint is ensured to have http:// as schema. Parsing the URL, setting the schema and then unparsing should do the job (https://docs.python.org/3/library/urllib.parse.html).