Scrapy keep wrong proxy setting
See original GitHub issueHi! I have one issue, when trying to test proxies with Scrapy. I want to check proxies with httpbin.org, and make crawler:
class CheckerSpider(scrapy.Spider):
name = "checker"
start_urls = (
'https://www.httpbin.org/ip'
)
connection = get_connection()
def start_requests(self):
with self.connection.cursor() as cursor:
limit = int((datetime.now() - datetime(1970, 1, 1)).total_seconds()) - 3600
q = """ SELECT *
FROM {}
WHERE active = 1 AND last_checked <= {} OR last_checked IS NULL;""".format(DB_TABLE, limit)
cursor.execute(q)
proxy_list = cursor.fetchall()
for proxy in proxy_list[:15]:
word = get_random_word()
req = scrapy.Request(self.start_urls, self.check_proxy, dont_filter=True)
req.meta['proxy'] = 'https://{}:8080'.format(proxy['ip'])
req.meta['item'] = proxy
user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
req.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)
req.headers['User-Agent'] = get_user_agent()
yield req
def check_proxy(self, response):
print response.request.meta['proxy']
print response.meta['item']['ip']
print response.body
But when I’m testing it, I’ve see that Scrapy connect to url only with 5 proxies and then didn’t change it. Example output (just messed IP):
2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.130:8080
192.168.100.130
{
"origin": "192.168.100.130"
}
2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.131:8080
192.168.100.131
{
"origin": "192.168.100.131"
}
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.132:8080
192.168.100.132
{
"origin": "192.168.100.132"
}
# Here Scrapy used wrong proxy to connect to site.
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.134:8080
192.168.100.134
{
"origin": "192.168.100.130"
}
May be I’ve make something wrong? Any idea? Thank you.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:3
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Scrapy and proxies - python - Stack Overflow
The easiest way to use a proxy is to set the environment variable http_proxy . How this is done depends on your shell....
Read more >How To Set Up A Custom Proxy In Scrapy - Zyte
In this article, I'm going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward...
Read more >Settings — Scrapy 2.7.1 documentation
When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...
Read more >Scrapy Proxy Middleware - Smartproxy Documentation
A quick guide on how to set up proxies on Scrapy Proxy Middleware.
Read more >Everything you need to know about Using a Proxy in Scrapy
Setting up a proxy in Scrapy is extremely easy. ... It helps keep users protected from the harmful stuff on the internet by...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Finally I’ve found the reason. My initial assumption that Scrapy caches connections was right. Subclassing original HTTPDownloadHandler and setting it’s HTTPConnectionPool persistence to False solves the problem.
And in settings.py
Should work on http and https 😃 Hope this helps. P.S. Would be nice to have this in original settings.
@DrJackilD , thanks for the feedback! Glad this is fixed now 😃