question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CheerioCrawler reuses proxy connections, breaking upstream IP rotation

See original GitHub issue

Describe the bug In Apify v0.2.*, every proxy request made by requestAsBrowser() used a distinct connection (https://github.com/apify/http-request/blob/d4a12a856acebfebbf22a5cf57593c81a4be6462/src/index.js#L136). With Apify v2.* and the move to got-scraping (and possibly v1.* though we skipped that version), proxy agent connections are now globally cached. This causes multiple proxied requests to an origin to travel down the same encrypted CONNECT tunnel, breaking upstream IP rotation done at the proxy layer (as is common with most datacentre offerings from providers like Bright Data or Proxy Rack). Essentially, requests that should exit the upstream proxy provider with different IPs all end up exiting with the same IP (leading to easy banning).

To Reproduce Setup a CheerioCrawler run to use a rotating proxy service and crawl through a few hundred https://httpbin.org/ip?random_string URLs. The exit IP will remain static.

If you comment out the agent caching logic in https://github.com/apify/got-scraping/blob/master/src/hooks/proxy.ts and re-run you will get the Apify 0.2.* behaviour (a different exit IP per request).

Expected behavior Same as Apify 0.2.* (proxy connections are not globally cached and re-used). Failing that, a migration path that lets you disable the agent caching done in got-scraping via an option to requestAsBrowser().

This behaviour would also match that of PuppeteerCrawler (incognito tabs result in a new proxy connection - and thus exit IP - per tab when used in conjunction with a rotating proxy service)

System information:

  • OS: Ubuntu: 20
  • Node.js version:16.x
  • Apify SDK version: 2.1

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
szmarczakcommented, Oct 15, 2021

Released got-scraping@3.2.6

1reaction
mnmkngcommented, Oct 12, 2021

cc @szmarczak

Apify Proxy randomly rotates upstream, if you don’t provide a session so this should be testable with it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Rotate Proxies and change IP Addresses using ...
A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you...
Read more >
Proxy - Apify Documentation
Apify Proxy monitors the health of your IP pool and intelligently rotates addresses to prevent IP address-based blocking. You can view your proxy...
Read more >
Upstream HTTP proxy cannot connect to InterScan Web ...
The upstream HTTP proxy often reuses the same TCP client port for a new connection despite the relevant TCP session being still left...
Read more >
11 Best Rotating Proxy for Web Scraping, SEO, and More...
Rotating proxies provide you with full access to a vast pool containing IP addresses. At regular intervals, the service provider assigns fresh IP...
Read more >
What is a rotating IP address? - Oxylabs
What is a proxy rotation? ISPs rotate IP addresses, and this process is transparent to their users. Although in the online world, there...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found