CheerioCrawler reuses proxy connections, breaking upstream IP rotation
See original GitHub issueDescribe the bug
In Apify v0.2.*, every proxy request made by requestAsBrowser()
used a distinct connection (https://github.com/apify/http-request/blob/d4a12a856acebfebbf22a5cf57593c81a4be6462/src/index.js#L136). With Apify v2.* and the move to got-scraping
(and possibly v1.* though we skipped that version), proxy agent connections are now globally cached. This causes multiple proxied requests to an origin to travel down the same encrypted CONNECT
tunnel, breaking upstream IP rotation done at the proxy layer (as is common with most datacentre offerings from providers like Bright Data or Proxy Rack). Essentially, requests that should exit the upstream proxy provider with different IPs all end up exiting with the same IP (leading to easy banning).
To Reproduce
Setup a CheerioCrawler run to use a rotating proxy service and crawl through a few hundred https://httpbin.org/ip?random_string
URLs. The exit IP will remain static.
If you comment out the agent caching logic in https://github.com/apify/got-scraping/blob/master/src/hooks/proxy.ts and re-run you will get the Apify 0.2.* behaviour (a different exit IP per request).
Expected behavior
Same as Apify 0.2.* (proxy connections are not globally cached and re-used). Failing that, a migration path that lets you disable the agent caching done in got-scraping
via an option to requestAsBrowser()
.
This behaviour would also match that of PuppeteerCrawler (incognito tabs result in a new proxy connection - and thus exit IP - per tab when used in conjunction with a rotating proxy service)
System information:
- OS: Ubuntu: 20
- Node.js version:16.x
- Apify SDK version: 2.1
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Released
got-scraping@3.2.6
cc @szmarczak
Apify Proxy randomly rotates upstream, if you don’t provide a session so this should be testable with it.