Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CheerioCrawler reuses proxy connections, breaking upstream IP rotation

See original GitHub issue

Describe the bug In Apify v0.2.*, every proxy request made by requestAsBrowser() used a distinct connection (https://github.com/apify/http-request/blob/d4a12a856acebfebbf22a5cf57593c81a4be6462/src/index.js#L136). With Apify v2.* and the move to got-scraping (and possibly v1.* though we skipped that version), proxy agent connections are now globally cached. This causes multiple proxied requests to an origin to travel down the same encrypted CONNECT tunnel, breaking upstream IP rotation done at the proxy layer (as is common with most datacentre offerings from providers like Bright Data or Proxy Rack). Essentially, requests that should exit the upstream proxy provider with different IPs all end up exiting with the same IP (leading to easy banning).

To Reproduce Setup a CheerioCrawler run to use a rotating proxy service and crawl through a few hundred https://httpbin.org/ip?random_string URLs. The exit IP will remain static.

If you comment out the agent caching logic in https://github.com/apify/got-scraping/blob/master/src/hooks/proxy.ts and re-run you will get the Apify 0.2.* behaviour (a different exit IP per request).

Expected behavior Same as Apify 0.2.* (proxy connections are not globally cached and re-used). Failing that, a migration path that lets you disable the agent caching done in got-scraping via an option to requestAsBrowser().

This behaviour would also match that of PuppeteerCrawler (incognito tabs result in a new proxy connection - and thus exit IP - per tab when used in conjunction with a rotating proxy service)

System information: