question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy keep wrong proxy setting

See original GitHub issue

Hi! I have one issue, when trying to test proxies with Scrapy. I want to check proxies with httpbin.org, and make crawler:

class CheckerSpider(scrapy.Spider):
    name = "checker"
    start_urls = (
        'https://www.httpbin.org/ip'
    )
    connection = get_connection()

    def start_requests(self):

        with self.connection.cursor() as cursor:
            limit = int((datetime.now() - datetime(1970, 1, 1)).total_seconds()) - 3600
            q = """ SELECT *
                    FROM {}
                    WHERE active = 1 AND last_checked <= {} OR last_checked IS NULL;""".format(DB_TABLE, limit)
            cursor.execute(q)
            proxy_list = cursor.fetchall()

        for proxy in proxy_list[:15]:
            word = get_random_word()
            req = scrapy.Request(self.start_urls, self.check_proxy, dont_filter=True)
            req.meta['proxy'] = 'https://{}:8080'.format(proxy['ip'])
            req.meta['item'] = proxy
            user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
            req.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)
            req.headers['User-Agent'] = get_user_agent()
            yield req

    def check_proxy(self, response):
        print response.request.meta['proxy']
        print response.meta['item']['ip']
        print response.body

But when I’m testing it, I’ve see that Scrapy connect to url only with 5 proxies and then didn’t change it. Example output (just messed IP):

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.130:8080
192.168.100.130
{
  "origin": "192.168.100.130"
}

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.131:8080
192.168.100.131
{
  "origin": "192.168.100.131"
}
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.132:8080
192.168.100.132
{
  "origin": "192.168.100.132"
}

# Here Scrapy used wrong proxy to connect to site.
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.134:8080
192.168.100.134
{
  "origin": "192.168.100.130"
}

May be I’ve make something wrong? Any idea? Thank you.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:3
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
rverbitskycommented, Mar 21, 2016

Finally I’ve found the reason. My initial assumption that Scrapy caches connections was right. Subclassing original HTTPDownloadHandler and setting it’s HTTPConnectionPool persistence to False solves the problem.

from scrapy.core.downloader.handlers.http import HTTPDownloadHandler


class RotatingProxiesDownloadHandler(HTTPDownloadHandler):

    def __init__(self, settings):
        super(RotatingProxiesDownloadHandler, self).__init__(settings)
        self._pool = HTTPConnectionPool(reactor, persistent=False)

And in settings.py

DOWNLOAD_HANDLERS = {
    'http': 'myproject.handlers.RotatingProxiesDownloadHandler',
    'https': 'myproject.handlers.RotatingProxiesDownloadHandler',
}

Should work on http and https 😃 Hope this helps. P.S. Would be nice to have this in original settings.

0reactions
redapplecommented, May 3, 2016

@DrJackilD , thanks for the feedback! Glad this is fixed now 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy and proxies - python - Stack Overflow
The easiest way to use a proxy is to set the environment variable http_proxy . How this is done depends on your shell....
Read more >
How To Set Up A Custom Proxy In Scrapy - Zyte
In this article, I'm going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward...
Read more >
Settings — Scrapy 2.7.1 documentation
When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...
Read more >
Scrapy Proxy Middleware - Smartproxy Documentation
A quick guide on how to set up proxies on Scrapy Proxy Middleware.
Read more >
Everything you need to know about Using a Proxy in Scrapy
Setting up a proxy in Scrapy is extremely easy. ... It helps keep users protected from the harmful stuff on the internet by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found