Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Autoscaled Pool doesn't properly downscale on overloaded client

See original GitHub issue

Describe the bug The crawler autoscaling has a feature that should downscale the concurrency in a case Apify client is overloaded (returning 429 statuses). As you can see from the test run, even with heavy overload, it still continues to scale up. I think the parameters for downscaling are probably just too high.

To Reproduce Run: https://console.apify.com/actors/BgCUCMvOzbBXp4uPG/runs/vpyiYdTegjhzVlfHv Actor code:

const Apify = require('apify');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();

    for (let i = 0; i < 2000; i++) {
        await requestQueue.addRequest({ url: `https://example.com/${i}`})
    }

    const crawler = new Apify.BasicCrawler({
        requestQueue,
        maxConcurrency: 1,
        maxRequestRetries: 99999,
        autoscaledPoolOptions: {
            loggingIntervalSecs: 0.1,
        },
        handleRequestFunction: async (context) => {
            await Apify.utils.sleep(50)
            throw 'Crash';
        },
    });

    setInterval(() => {
        crawler.autoscaledPool.maxConcurrency += 1;
    }, 5000)

    crawler.log.exception = () => {}

    await crawler.run();
});

Expected behavior The crawler starts downscaling or at least maintaining the concurrency at some point. When it consistently hits 429, it should start.

System information: {“apifyVersion”:“2.2.2”,“apifyClientVersion”:“2.2.0”,“osType”:“Linux”,“nodeVersion”:“v16.14.0”}

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 2 years ago
Comments:28 (28 by maintainers)

Top GitHub Comments

1reaction

B4nancommented, Mar 9, 2022

So after fixing the issue with multiple client instances, we have another deeper problem. Currently we wait for 3 seconds before we consider retrying the failed request. This results in trying out new requests before we retry the failed ones. We keep fetching more and more from the queue, reaching the 1000 limit which prints the warning, as we basically never retry anything before we go thru the whole queue - it happens faster than those 3 seconds with higher concurrency.

Due to this, it takes thousands of rate limitted requests for downscaling to happen, as the heuristic is probably not designed for high concurrency. All it does is to compare two snapshots and check the difference between rate limitted requests. We only consider second retries, which makes it much worse. Thousands of failed requests are made before we downscale once, as the error delta is usually less than 3 (currently set limit).

The SDK tries more and more requests (because of that 3s timeout we dont retry them), then the client is retrying all of them once we start hitting the rate limit threshold, and until it starts doing the second retries (which are considered for the overloading metric), it is flooded with new requests - it does thousands of “first tries”, tens of first retries and only very few of second retries. Concurrency keep growing, as we mostly have the error rate delta 0 or 1, so we upscale many times, and downscale almost never. As an example, this is the error rate limit histogram after few minutes: [ 3970, 7, 3, 1 ], I am not 100% sure how to interpret such high number on the first position, given we only have 2k requests in the queue as in the repro. I guess it is caused by retries on SDK level after those 3 seconds pass.

I would say we should not consider second retries, we should care about the overall error rate of the client. This metric is too weak for high concurrency especially due to the 3s timeout we wait before retrying failed requests.

Then there is (wait for it) another issue - in the reproduction we have snapshotting set to 0.1s, so we do snapshots 10 times a second. As we check the error rate delta only, this happens maybe once a second - if we do snapshots too often if miss the deltas and again don’t downscale (but rather upscale, lol).

what a rabbit hole

0reactions

metalwarrior665commented, Mar 9, 2022

@mnmkng Ah, that’s true, my bad