Scrapy seems to fail to load some sites when using proxy or user agent middleware?
See original GitHub issueDescription
I am trying to use some HTTP proxies with Scrapy in order to reduce the duration between crawl times and I have seem to be having issues no matter which middleware I use.
The more recent proxy middleware I have tried is as follows (seems most up-to-date): https://github.com/TeamHG-Memex/scrapy-rotating-proxies
The above works fine for most sites, but something problematic is occurring for some repeat offenders. I have noticed that these same websites have the same kind of connection issues when trying to use middlewares for user agent switches.
As soon as I remove all of these middlewares (proxy / user agent), the issues with these sites go away. I cannot access them with one or both.
I am raising this issue here as opposed to the individual middleware’s github as I seem to be experiencing this across the board, so not sure if this is something under the hood or not.
A recent example of this is as follows:
https://very.co.uk https://www.very.co.uk https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100
I can successfully hit very.co.uk with scrapy fetch
(passing my user agent), however as soon as I get the 301 redirect, something goes wrong and the connection fails to the redirected URL. I cannot successfully fetch/request https://www.very.co.uk during a crawl or a fetch when using a proxy.
At first I suspected that I may have an issue with the proxies that I’m using (i.e. access denied due to being blocked), so I tried to access both pages with Curl and I successfully received the 301 response and a subsequent http 200 (with response data) for the second URL which I’m unable to access with Scrapy when using the same proxy.
curl --proxy http://myuser:mypass@myproxyip:80 -v -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "https://www.very.co.uk"
Steps to Reproduce
scrapy fetch "https://very.co.uk" -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
(with the proxy middleware installed / enabled)
Expected behavior: [What you expect to happen] 301 --> https://www.very.co.uk --> HTTP 200
Actual behavior: [What actually happens] 301 --> https://www.very.co.uk --> Dead / timeouts / connection failures
Reproduces how often: [What percentage of the time does it reproduce?] 100%
Versions
Scrapy : 2.4.1 lxml : 4.6.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Windows-10-10.0.19041-SP0
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (5 by maintainers)
I may have a lead: it seems to work when the proxy scheme is
https
but does not work when it’shttp
.