question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy seems to fail to load some sites when using proxy or user agent middleware?

See original GitHub issue

Description

I am trying to use some HTTP proxies with Scrapy in order to reduce the duration between crawl times and I have seem to be having issues no matter which middleware I use.

The more recent proxy middleware I have tried is as follows (seems most up-to-date): https://github.com/TeamHG-Memex/scrapy-rotating-proxies

The above works fine for most sites, but something problematic is occurring for some repeat offenders. I have noticed that these same websites have the same kind of connection issues when trying to use middlewares for user agent switches.

As soon as I remove all of these middlewares (proxy / user agent), the issues with these sites go away. I cannot access them with one or both.

I am raising this issue here as opposed to the individual middleware’s github as I seem to be experiencing this across the board, so not sure if this is something under the hood or not.

A recent example of this is as follows:

https://very.co.uk https://www.very.co.uk https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100

I can successfully hit very.co.uk with scrapy fetch (passing my user agent), however as soon as I get the 301 redirect, something goes wrong and the connection fails to the redirected URL. I cannot successfully fetch/request https://www.very.co.uk during a crawl or a fetch when using a proxy.

At first I suspected that I may have an issue with the proxies that I’m using (i.e. access denied due to being blocked), so I tried to access both pages with Curl and I successfully received the 301 response and a subsequent http 200 (with response data) for the second URL which I’m unable to access with Scrapy when using the same proxy.

curl --proxy http://myuser:mypass@myproxyip:80 -v -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "https://www.very.co.uk"

Steps to Reproduce

  1. scrapy fetch "https://very.co.uk" -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" (with the proxy middleware installed / enabled)

Expected behavior: [What you expect to happen] 301 --> https://www.very.co.uk --> HTTP 200

Actual behavior: [What actually happens] 301 --> https://www.very.co.uk --> Dead / timeouts / connection failures

Reproduces how often: [What percentage of the time does it reproduce?] 100%

Versions

Scrapy : 2.4.1 lxml : 4.6.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Windows-10-10.0.19041-SP0

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mmotticommented, Dec 2, 2022

Have you had any luck resoving this ? I wasn’t able to fix this. Haven’t done any scraping for a while now though.

0reactions
Elias-SLHcommented, Dec 8, 2022

I may have a lead: it seems to work when the proxy scheme is https but does not work when it’s http.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Requests through a proxy with Scrapy doesn't seem to work ...
Something is off, and all the requests are done through my public IP and not through any of the proxies in the list...
Read more >
Scrapy Beginners Series Part 4 - User Agents and Proxies
In Part 4 we will be exploring how to use User Agents and Proxies to bypass restrictions on sites who are trying to...
Read more >
Settings — Scrapy 2.7.1 documentation
When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...
Read more >
scrapy-sessions - PyPI
This is important for engaging with websites that have a session-expiry system based on profile (IP/user-agent) or use short-lived sessions that require a ......
Read more >
Advanced Python Web Scraping Tactics - Pluralsight
IP address blocking is another common issue that a web crawler faces. · Some websites use anti-scraping technologies which makes the site hard...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found