question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

https proxy dont work!

See original GitHub issue

I want scraping http://www.lagou.com with proxy. I collected a lot of proxy server like this: {‘https’: ‘https://110.72.7.236:8123’} some of server can use http, some just support https.

I have some code check server is ok:

import requests

proxies = [
 {'https': 'https://39.66.5.213:8998'},
 {'https': 'https://111.200.106.5:8123'},
 {'https': 'https://110.72.7.236:8123'},
 {'https': 'https://123.180.122.83:8998'}
]
checkUrl = 'http://www.lagou.com/

def _check_proxy(proxy, timeout = None):
    try:
        r = requests.get(checkUrl, proxies =  proxy ,timeout = timeout)
        return r.status_code
   except:
        return 0

def main():
    for proxy in proxies:
        print proxy, _check_proxy(proxy, timeout=20)
main()

I think proxy server states is health.

BUT, when I proxy to scrapy , it’s not ok, my proxy middleware code:


#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')

import pdb
import random
import base64
import time
import datetime
from codec import convert
from scrapy import signals
from scrapy.utils.project import get_project_settings

settings = get_project_settings()

class ProxyMiddleware(object):

    def __init__(self, settings):
        self.proxies = settings.get('PROXIES', [])

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(crawler.settings)
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def spider_closed(self, spider):
        pass

    def spider_opened(self, spider):
        self.PROXIES = [
{'username': '', 'ip': '119.29.120.97', 'password': '', 'type': 'https', 'port': '80'},
{'username': '', 'ip': '115.46.68.190', 'password': '', 'type': 'https', 'port': '8123'},
{'username': '', 'ip': '60.163.3.9', 'password': '', 'type': 'https', 'port': '8118'},
{'username': '', 'ip': '171.37.134.201', 'password': '', 'type': 'https', 'port': '8123'},
]


    def process_request(self, request, spider):
        # Don't overwrite with a random one (server-side state for IP)
        #  if 'proxy' in request.meta:
            #  return
        proxy = random.choice(self.PROXIES)
        proxy_user_pass = proxy['username'] +':'+ proxy['password'] if proxy['username'] else ''
        print proxy_user_pass
        request.meta['proxy'] = "%s://%s:%s" % (proxy['type'], proxy['ip'], proxy['port'])
        #  request.meta['proxy'] = "http://%s:%s" % (proxy['ip'], proxy['port'])
        #  request.meta['proxy'] = 'https://180.76.163.61:10000'
        print request.meta
        if proxy_user_pass:
            basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = basic_auth

    def process_exception(self, request, exception, spider):
        proxy = request.meta['proxy']
        spider.logger.error('%s! Removing failed proxy <%s>, %d proxies left' % (
            exception, proxy, len(self.proxies)))
        try:
            self.proxies.remove(proxy)
        except ValueError:
            pass

some error like this:

2016-07-21 16:50:10 [scrapy] DEBUG: Retrying <POST http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false> (failed 1 times): [<twisted.python.failure.Failure <class ‘twisted.internet.error.ConnectionLost’>>] 2016-07-21 16:50:10 [scrapy] DEBUG: Retrying <POST http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false> (failed 2 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class ‘twisted.internet.error.ConnectionLost’>: Connection to the other side was lost in a non-clean fashion: Connection lost. ]. 2016-07-21 16:50:10 [scrapy] DEBUG: Gave up retrying <POST http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false> (failed 3 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class ‘twisted.internet.error.ConnectionLost’>: Connection to the other side was lost in a non-clean fashion: Connection lost. ]. 2016-07-21 16:50:10 [lagou_new] ERROR: An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class ‘twisted.internet.error.ConnectionLost’>: Connection to the other side was lost in a non-clean fashion: Connection lost. ].! Removing failed proxy https://1.59.94.112:8998, 0 proxies left 2016-07-21 16:50:10 [scrapy] DEBUG: Retrying <POST http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class ‘twisted.internet.error.ConnectionLost’>: Connection to the other side was lost in a non-clean fashion: Connection lost. ].

I did some test:

{‘https’:‘https://180.76.163.61:10000’} can support http, when i use it. request.meta[‘proxy’] = ‘https://180.76.163.61:10000

it’s worked!

else server cannot support http, will did not work.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
BuGoNeecommented, Jul 21, 2016

Thanks for your help I think the issue can close. I’ll give a try to use virtualenv

1reaction
redapplecommented, Jul 21, 2016

@BuGoNee , what version of scrapy are you using? (check with scrapy version -v) The recently released Scrapy 1.1.1 version has a fix for HTTPS proxies related to missing Host: header. See https://github.com/scrapy/scrapy/pull/2069

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTTPS connections over proxy servers - Stack Overflow
The TCP proxy cannot see the HTTP content being transferred in clear text, but that doesn't affect its ability to forward packets back...
Read more >
[BUG] `https-proxy` config doesn't work with authenticated proxy
[BUG] https-proxy config doesn't work with authenticated proxy #3810 ... npm config set registry http://registry.npmjs.org/ nom config set ...
Read more >
How to Fix “There Is Something Wrong With the Proxy Server ...
Click on Internet Options. This will open a menu called Internet Properties. Then, change the menu from General to Connections by clicking on ......
Read more >
Can't consume web services via an HTTP proxy server - .NET ...
To resolve this problem, supply the proper proxy configuration settings to the .NET client. The following are the default settings in the ...
Read more >
Browser & System Configuration - Charles Proxy
To use Charles as your HTTP proxy on your iPhone you must manually configure the ... If it doesn't, please try quitting and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found