Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy seems to fail to load some sites when using proxy or user agent middleware?

See original GitHub issue

Description

I am trying to use some HTTP proxies with Scrapy in order to reduce the duration between crawl times and I have seem to be having issues no matter which middleware I use.

The more recent proxy middleware I have tried is as follows (seems most up-to-date): https://github.com/TeamHG-Memex/scrapy-rotating-proxies

The above works fine for most sites, but something problematic is occurring for some repeat offenders. I have noticed that these same websites have the same kind of connection issues when trying to use middlewares for user agent switches.

As soon as I remove all of these middlewares (proxy / user agent), the issues with these sites go away. I cannot access them with one or both.

I am raising this issue here as opposed to the individual middleware’s github as I seem to be experiencing this across the board, so not sure if this is something under the hood or not.

A recent example of this is as follows:

https://very.co.uk https://www.very.co.uk https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100

I can successfully hit very.co.uk with scrapy fetch (passing my user agent), however as soon as I get the 301 redirect, something goes wrong and the connection fails to the redirected URL. I cannot successfully fetch/request https://www.very.co.uk during a crawl or a fetch when using a proxy.

At first I suspected that I may have an issue with the proxies that I’m using (i.e. access denied due to being blocked), so I tried to access both pages with Curl and I successfully received the 301 response and a subsequent http 200 (with response data) for the second URL which I’m unable to access with Scrapy when using the same proxy.

curl --proxy http://myuser:mypass@myproxyip:80 -v -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "https://www.very.co.uk"

Steps to Reproduce

scrapy fetch "https://very.co.uk" -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" (with the proxy middleware installed / enabled)

Expected behavior: [What you expect to happen] 301 --> https://www.very.co.uk --> HTTP 200

Actual behavior: [What actually happens] 301 --> https://www.very.co.uk --> Dead / timeouts / connection failures

Reproduces how often: [What percentage of the time does it reproduce?] 100%

Versions

Scrapy : 2.4.1 lxml : 4.6.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Windows-10-10.0.19041-SP0

Issue Analytics

State:
Created 3 years ago
Comments:14 (5 by maintainers)

Top GitHub Comments

1reaction

mmotticommented, Dec 2, 2022

Have you had any luck resoving this ? I wasn’t able to fix this. Haven’t done any scraping for a while now though.

0reactions

Elias-SLHcommented, Dec 8, 2022

I may have a lead: it seems to work when the proxy scheme is https but does not work when it’s http.

Top Results From Across the Web

Requests through a proxy with Scrapy doesn't seem to work ...

Something is off, and all the requests are done through my public IP and not through any of the proxies in the list...

Scrapy Beginners Series Part 4 - User Agents and Proxies

In Part 4 we will be exploring how to use User Agents and Proxies to bypass restrictions on sites who are trying to...

Settings — Scrapy 2.7.1 documentation

When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...

scrapy-sessions - PyPI

This is important for engaging with websites that have a session-expiry system based on profile (IP/user-agent) or use short-lived sessions that require a ......

Advanced Python Web Scraping Tactics - Pluralsight

IP address blocking is another common issue that a web crawler faces. · Some websites use anti-scraping technologies which makes the site hard...