SSL errors crawling https sites using proxies
See original GitHub issueI’m unable to scrape https sites through https supported proxies. I’ve tried with proxymesh as well as other proxy services. I can scrape most of this sites without proxies or using Tor.
Curl seems to work fine too:
curl -x https://xx.xx.xx.xx:xx --proxy-user user:pass -L https://www.base.net:443
Retrieves the site’s html.
Setup:
- OS: OS X El Capitan v10.11.3
Scrapy:
scrapy version -v
Scrapy : 1.0.5
lxml : 3.5.0.0
libxml2 : 2.9.2
Twisted : 15.5.0
Python : 2.7.11 (default, Dec 7 2015, 23:36:10) - [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2g 1 Mar 2016)
Platform : Darwin-15.3.0-x86_64-i386-64bit
Solutions tried:
1 - Installing Scrapy-1.1.0rc3
2016-03-09 12:44:59 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Other website:
2016-03-09 12:56:45 [scrapy] DEBUG: Retrying <GET https://es.alojadogatopreto.com/es-es/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]
2 - https://github.com/scrapy/scrapy/issues/1764#issuecomment-181950638
Using SSLv23_METHOD
2016-03-09 12:22:40 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>]
Using other SSL methods
2016-03-09 12:24:11 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'wrong version number')]>]
3 - https://github.com/scrapy/scrapy/issues/1227#issuecomment-154890557 | Get same errors as in 1 & 2. 4 - https://github.com/scrapy/scrapy/issues/1429#issuecomment-131187012 | Get same errors as in 1 & 2.
Issue Analytics
- State:
- Created 8 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Thanks for answering @redapple.
The solution was changing
base64.encodestring
tobase64.b64encode
in my ProxyMiddleware. Didscrapy shell 'https://www.base.net'
a few times and printedrequest.meta
. The value formeta['proxy']
changes each time and corresponds to those in my proxy list.You can alternatively use w3lib.http.basic_auth_header