question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using proxy through http fails (https works)

See original GitHub issue

Description

When I scrape without proxy, both https and http urls work. Using proxy through https works just fine. My problem is when I try http urls. In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

  1. Scrape a http link with proxy

Expected behavior: Get a 200 with the desired data.

Actual behavior:

ERROR: Error downloading <GET http://*********>
Traceback (most recent call last):
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
    return handler.download_request(request, spider)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 82, in download_request
    return agent.download_request(request)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 361, in download_request
    d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 262, in request
    endpoint=self._getEndpoint(self._proxyURI),
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1729, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1607, in endpointForURI
    raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''

Reproduces how often: Every time I scrape with proxy

Versions

Scrapy       : 2.0.1
lxml         : 4.4.1.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.7.3 (default, Apr  3 2019, 05:39:12) - [GCC 8.3.0]
pyOpenSSL    : 19.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Linux-4.19.0-5-amd64-x86_64-with-debian-10.0

Additional context

I tried to add some breakpoints at the end to see where it cracks. I added the following lines in “twisted/web/client/py”, before the cracking point:

        endpoint = HostnameEndpoint(self._reactor, host, uri.port, **kwargs)
        import logging
        logger = logging.getLogger(__name__)
        logger.error("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
        logger.error(uri)
        logger.error(uri.host)
        logger.error(uri.port)
        logger.error(uri.scheme)
        logger.error(dir(uri))
        if uri.scheme == b'http':
            return endpoint
        elif uri.scheme == b'https':
            connectionCreator = self._policyForHTTPS.creatorForNetloc(uri.host,
                                                                      uri.port)
            return wrapClientTLS(connectionCreator, endpoint)
        else:
            raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))

Apparently in this point there is no schema. If I run the same code with a https url, this code is never reached. It seems that getting up to point there is bad and the proxy is not used

(edited to apply formatting)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
liveprasadcommented, Apr 24, 2020

@Gallaecio I would like to contribute , I will start this as my first open source contribution But I may need some help from you

1reaction
Gallaeciocommented, Apr 23, 2020

I guess we can take this as an enhancement to support schema-less HTTP proxy URLs.

I checked, and there is no bug, the logic to handle HTTP and HTTPS proxies is different, and the HTTPS one is implemented in a way that the schema is not needed in the proxy URL.

As a reference for people wishing to work on this, it should be as simple as modifying ScrapyProxyAgent.request so that the URL parameter passed to self._getEndpoint is ensured to have http:// as schema. Parsing the URL, setting the schema and then unparsing should do the job (https://docs.python.org/3/library/urllib.parse.html).

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTTPS connections over proxy servers - Stack Overflow
The trick is, we turn an HTTP proxy into a TCP proxy with a special command named CONNECT . Not all HTTP proxies...
Read more >
Can't consume web services via an HTTP proxy server - .NET ...
This article provides a resolution to fix the error that occurs on a .NET client that consumes a Web service via an HTTP...
Read more >
Problem with using a proxy for http client requests
I'm using a proxy which directly passes all data, it does not break SSL or switch certificates or anything else. When doing GETs...
Read more >
How to Fix “There Is Something Wrong With the Proxy Server ...
2. Restore Your Proxy Server to Its Default Settings · Go to the Control Panel. · Click on Internet Options. · Click on...
Read more >
Using an HTTP proxy - AWS Command Line Interface
To access AWS through proxy servers, you can configure the HTTP_PROXY and HTTPS_PROXY environment variables with either the DNS domain names or IP...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found