Autoscaled Pool doesn't properly downscale on overloaded client
See original GitHub issueDescribe the bug The crawler autoscaling has a feature that should downscale the concurrency in a case Apify client is overloaded (returning 429 statuses). As you can see from the test run, even with heavy overload, it still continues to scale up. I think the parameters for downscaling are probably just too high.
To Reproduce Run: https://console.apify.com/actors/BgCUCMvOzbBXp4uPG/runs/vpyiYdTegjhzVlfHv Actor code:
const Apify = require('apify');
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
for (let i = 0; i < 2000; i++) {
await requestQueue.addRequest({ url: `https://example.com/${i}`})
}
const crawler = new Apify.BasicCrawler({
requestQueue,
maxConcurrency: 1,
maxRequestRetries: 99999,
autoscaledPoolOptions: {
loggingIntervalSecs: 0.1,
},
handleRequestFunction: async (context) => {
await Apify.utils.sleep(50)
throw 'Crash';
},
});
setInterval(() => {
crawler.autoscaledPool.maxConcurrency += 1;
}, 5000)
crawler.log.exception = () => {}
await crawler.run();
});
Expected behavior The crawler starts downscaling or at least maintaining the concurrency at some point. When it consistently hits 429, it should start.
System information: {“apifyVersion”:“2.2.2”,“apifyClientVersion”:“2.2.0”,“osType”:“Linux”,“nodeVersion”:“v16.14.0”}
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:28 (28 by maintainers)
Top Results From Across the Web
autoscaler/FAQ.md at master - GitHub
Putting CA on an overloaded node would not allow to reach the declared performance. We didn't run any performance tests on clusters bigger...
Read more >Autoscaling guidance - Best practices for cloud applications
Review autoscaling guidance. Autoscaling is the process of dynamically allocating resources to match performance requirements.
Read more >Cluster Autoscaler: How It Works and Solving Common ...
Pending Nodes Exist But Cluster Does Not Scale Up ; All suitable node groups are at maximum size. Increase the maximum size of...
Read more >Understanding autoscaler decisions - Compute Engine
Learn how Compute Engine autoscaling determines how to scale your virtual machine (VM) instances.
Read more >Autoscaling - Amazon EKS - AWS Documentation
You detach the policy from the node IAM role for Cluster Autoscaler to function properly. Detaching the policy doesn't give other pods on...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So after fixing the issue with multiple client instances, we have another deeper problem. Currently we wait for 3 seconds before we consider retrying the failed request. This results in trying out new requests before we retry the failed ones. We keep fetching more and more from the queue, reaching the 1000 limit which prints the warning, as we basically never retry anything before we go thru the whole queue - it happens faster than those 3 seconds with higher concurrency.
Due to this, it takes thousands of rate limitted requests for downscaling to happen, as the heuristic is probably not designed for high concurrency. All it does is to compare two snapshots and check the difference between rate limitted requests. We only consider second retries, which makes it much worse. Thousands of failed requests are made before we downscale once, as the error delta is usually less than 3 (currently set limit).
The SDK tries more and more requests (because of that 3s timeout we dont retry them), then the client is retrying all of them once we start hitting the rate limit threshold, and until it starts doing the second retries (which are considered for the overloading metric), it is flooded with new requests - it does thousands of “first tries”, tens of first retries and only very few of second retries. Concurrency keep growing, as we mostly have the error rate delta 0 or 1, so we upscale many times, and downscale almost never. As an example, this is the error rate limit histogram after few minutes:
[ 3970, 7, 3, 1 ]
, I am not 100% sure how to interpret such high number on the first position, given we only have 2k requests in the queue as in the repro. I guess it is caused by retries on SDK level after those 3 seconds pass.I would say we should not consider second retries, we should care about the overall error rate of the client. This metric is too weak for high concurrency especially due to the 3s timeout we wait before retrying failed requests.
Then there is (wait for it) another issue - in the reproduction we have snapshotting set to 0.1s, so we do snapshots 10 times a second. As we check the error rate delta only, this happens maybe once a second - if we do snapshots too often if miss the deltas and again don’t downscale (but rather upscale, lol).
what a rabbit hole
@mnmkng Ah, that’s true, my bad