question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Autoscaled Pool doesn't properly downscale on overloaded client

See original GitHub issue

Describe the bug The crawler autoscaling has a feature that should downscale the concurrency in a case Apify client is overloaded (returning 429 statuses). As you can see from the test run, even with heavy overload, it still continues to scale up. I think the parameters for downscaling are probably just too high.

To Reproduce Run: https://console.apify.com/actors/BgCUCMvOzbBXp4uPG/runs/vpyiYdTegjhzVlfHv Actor code:

const Apify = require('apify');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();

    for (let i = 0; i < 2000; i++) {
        await requestQueue.addRequest({ url: `https://example.com/${i}`})
    }

    const crawler = new Apify.BasicCrawler({
        requestQueue,
        maxConcurrency: 1,
        maxRequestRetries: 99999,
        autoscaledPoolOptions: {
            loggingIntervalSecs: 0.1,
        },
        handleRequestFunction: async (context) => {
            await Apify.utils.sleep(50)
            throw 'Crash';
        },
    });

    setInterval(() => {
        crawler.autoscaledPool.maxConcurrency += 1;
    }, 5000)

    crawler.log.exception = () => {}

    await crawler.run();
});

Expected behavior The crawler starts downscaling or at least maintaining the concurrency at some point. When it consistently hits 429, it should start.

System information: {“apifyVersion”:“2.2.2”,“apifyClientVersion”:“2.2.0”,“osType”:“Linux”,“nodeVersion”:“v16.14.0”}

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:28 (28 by maintainers)

github_iconTop GitHub Comments

1reaction
B4nancommented, Mar 9, 2022

So after fixing the issue with multiple client instances, we have another deeper problem. Currently we wait for 3 seconds before we consider retrying the failed request. This results in trying out new requests before we retry the failed ones. We keep fetching more and more from the queue, reaching the 1000 limit which prints the warning, as we basically never retry anything before we go thru the whole queue - it happens faster than those 3 seconds with higher concurrency.

Due to this, it takes thousands of rate limitted requests for downscaling to happen, as the heuristic is probably not designed for high concurrency. All it does is to compare two snapshots and check the difference between rate limitted requests. We only consider second retries, which makes it much worse. Thousands of failed requests are made before we downscale once, as the error delta is usually less than 3 (currently set limit).

The SDK tries more and more requests (because of that 3s timeout we dont retry them), then the client is retrying all of them once we start hitting the rate limit threshold, and until it starts doing the second retries (which are considered for the overloading metric), it is flooded with new requests - it does thousands of “first tries”, tens of first retries and only very few of second retries. Concurrency keep growing, as we mostly have the error rate delta 0 or 1, so we upscale many times, and downscale almost never. As an example, this is the error rate limit histogram after few minutes: [ 3970, 7, 3, 1 ], I am not 100% sure how to interpret such high number on the first position, given we only have 2k requests in the queue as in the repro. I guess it is caused by retries on SDK level after those 3 seconds pass.

I would say we should not consider second retries, we should care about the overall error rate of the client. This metric is too weak for high concurrency especially due to the 3s timeout we wait before retrying failed requests.

Then there is (wait for it) another issue - in the reproduction we have snapshotting set to 0.1s, so we do snapshots 10 times a second. As we check the error rate delta only, this happens maybe once a second - if we do snapshots too often if miss the deltas and again don’t downscale (but rather upscale, lol).

what a rabbit hole

0reactions
metalwarrior665commented, Mar 9, 2022

@mnmkng Ah, that’s true, my bad

Read more comments on GitHub >

github_iconTop Results From Across the Web

autoscaler/FAQ.md at master - GitHub
Putting CA on an overloaded node would not allow to reach the declared performance. We didn't run any performance tests on clusters bigger...
Read more >
Autoscaling guidance - Best practices for cloud applications
Review autoscaling guidance. Autoscaling is the process of dynamically allocating resources to match performance requirements.
Read more >
Cluster Autoscaler: How It Works and Solving Common ...
Pending Nodes Exist But Cluster Does Not Scale Up ; All suitable node groups are at maximum size. Increase the maximum size of...
Read more >
Understanding autoscaler decisions - Compute Engine
Learn how Compute Engine autoscaling determines how to scale your virtual machine (VM) instances.
Read more >
Autoscaling - Amazon EKS - AWS Documentation
You detach the policy from the node IAM role for Cluster Autoscaler to function properly. Detaching the policy doesn't give other pods on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found