question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SSL errors crawling https sites using proxies

See original GitHub issue

I’m unable to scrape https sites through https supported proxies. I’ve tried with proxymesh as well as other proxy services. I can scrape most of this sites without proxies or using Tor.

Curl seems to work fine too: curl -x https://xx.xx.xx.xx:xx --proxy-user user:pass -L https://www.base.net:443 Retrieves the site’s html.

Setup:

  • OS: OS X El Capitan v10.11.3

Scrapy:

scrapy version -v
Scrapy    : 1.0.5
lxml      : 3.5.0.0
libxml2   : 2.9.2
Twisted   : 15.5.0
Python    : 2.7.11 (default, Dec  7 2015, 23:36:10) - [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)]
pyOpenSSL : 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Darwin-15.3.0-x86_64-i386-64bit

Solutions tried: 1 - Installing Scrapy-1.1.0rc3 2016-03-09 12:44:59 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>] Other website: 2016-03-09 12:56:45 [scrapy] DEBUG: Retrying <GET https://es.alojadogatopreto.com/es-es/> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'ssl23_read', 'ssl handshake failure')]>]

2 - https://github.com/scrapy/scrapy/issues/1764#issuecomment-181950638 Using SSLv23_METHOD 2016-03-09 12:22:40 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL23_GET_SERVER_HELLO', 'unknown protocol')]>] Using other SSL methods 2016-03-09 12:24:11 [scrapy] ERROR: Error downloading <GET https://www.base.net/>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_RECORD', 'wrong version number')]>]

3 - https://github.com/scrapy/scrapy/issues/1227#issuecomment-154890557 | Get same errors as in 1 & 2. 4 - https://github.com/scrapy/scrapy/issues/1429#issuecomment-131187012 | Get same errors as in 1 & 2.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

12reactions
Cespedcommented, Mar 9, 2016

Thanks for answering @redapple.

The solution was changing base64.encodestring to base64.b64encode in my ProxyMiddleware. Did scrapy shell 'https://www.base.net' a few times and printed request.meta. The value for meta['proxy']changes each time and corresponds to those in my proxy list.

0reactions
Gallaeciocommented, Oct 30, 2020

You can alternatively use w3lib.http.basic_auth_header

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why am I unable to crawl people over SSL / HTTPS?
To configure the crawler proxy settings, use Search Administration page. No matter what I do I can't seem to get rid of this...
Read more >
What is an SSL Proxy and How Does it Work - Smartproxy
HTTPS proxies use the SSL layer to encrypt any information going between your endpoint and the website, service, or server you want to...
Read more >
The secure sockets layer (SSL) certificate sent by the server ...
One thought on “The secure sockets layer (SSL) certificate sent by the server was invalid and this item will not be crawled –...
Read more >
Transparent Proxy HTTPS Issues DevCentral
Hi,. I am in the process of setting up a transparent proxy with LTM, which will be load balancing browsing traffic two different...
Read more >
Is it possible to access https pages through a proxy with Scrapy?
It might be the case, however, that your proxy does not support HTTPS. It would be easier to help you if you posted...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found