Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`ReadManyItemsAsync` causes threadpool starvation on low-end App Service (B1)

See original GitHub issue

Describe the bug Using the ReadManyItemsAsync causes thread pool starvation in our application. This is observed by seeing Kestrel emit the log line below a bunch of times (which was also a red flag in another issue on this repo), health checks going down, and the executing endpoint not finishing.

As of "01/22/2023 11:14:20 +00:00", the heartbeat has been running for "00:00:03.4144034" which is longer than "00:00:01". This could be caused by thread pool starvation.

To Reproduce I don’t have a reproduction scenario at hand, or in a public repo (as this happens as part of a business application with closed sources). However, the full snippet of the code in action is the code below. If more context is needed, and access to our private source is helpful, then I can provide that.

Some numbers:

The resulting budgetIdToEndUserId dictionary has 1734 items. The ActivePersonalBudgetSummary type are a “summary” document that only represents the data we need in this context, the full document is much bigger (we’re in the process of splitting these large documents up in smaller ones).
size of personalBudgetCollection: 2014
size of activePersonalBudgetCollection: 13186
size of the full dump of these 2 collections: 4.9 GB (with dt.exe, and indented)

When we change the ReadManyItemsAsync() call to “many ReadManyItemAsync()” (no parallelism being done), then the endpoint takes ~1 min.

    private async Task<Dictionary<Guid, ActivePersonalBudgetSummary>> GetByPredicateAsync(Func<ActivePersonalBudgetSummary, bool> predicate)
    {
        var personalBudgetCollection = _cosmosClient.GetDatabase(_databaseId).GetContainer(Constants.CosmosDbContainerConstants.PersonalBudgetCollection.ContainerId);
        var activePersonalBudgetCollection = _cosmosClient.GetDatabase(_databaseId).GetContainer(Constants.CosmosDbContainerConstants.ActivePersonalBudgetCollection.ContainerId);

        var personalBudgetIterator = personalBudgetCollection.GetItemQueryIterator<EndUserActivePersonalBudget>(
            "SELECT c.EndUserId, c.ActivePersonalBudgetId FROM c WHERE c.IsSelected");

        var selectedBudgetIds = await personalBudgetIterator.ToListAsync(_monitoringEvents);

        var budgetIdToEndUserId = selectedBudgetIds
            .Where(x => x.ActivePersonalBudgetId is not null)
            .ToDictionary(
                x => x.ActivePersonalBudgetId!,
                x => x.EndUserId);

        var results = new Dictionary<Guid, ActivePersonalBudgetSummary>();
        
        var response = await activePersonalBudgetCollection.ReadManyItemsAsync<ActivePersonalBudgetSummaryDocument>(
            budgetIdToEndUserId.Keys.Select(id => (id, new PartitionKey(id))).ToImmutableList());
        _monitoringEvents?.CosmosDbDiagnosticsReceived(response.Diagnostics);

        foreach (var item in response)
        {
            if (budgetIdToEndUserId.TryGetValue(item.Id, out var endUserId))
            {
                var summary = ToDomain(item);
                if (predicate(summary))
                    results[endUserId] = summary;
            }
        }

        return results;
    }

Expected behavior No thread pool starvation happens, and the endpoint returns successfully.

Actual behavior Connecting to the instance using az webapp create-remote-connection ..., and running dotnet counters as suggested in the how to detect thread pool starvation validates the idea that this is caused by thread pool starvation. The app started out with a thread pool count of 6, and the highest that I saw was going all the way up to 15 (see screenshots below). Full video can be found here.

At start of the request image-20230123-160440

After 2:50m image-20230123-160517

Environment summary SDK Version: 3.31.2 OS Version (e.g. Windows, Linux, MacOSX):

Azure App Service linux
.NET 7 (7.0.102)
CosmosDB provisioning model: serverless

Additional context We know that the hardware that this app is running on is not production level hardware, from the perspective of App Service. However, this hasn’t been an issue so far, so we haven’t had any need to scale up to a more production level provisioning. If this bug is related to the restricted environment, then I can fully understand that, and closing down the issue from that angle.

Issue Analytics

State:
Created 8 months ago
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

NaluTripiciancommented, Feb 1, 2023

Hello, here is some information on the ReadMany API that will hopefully provide some insight into your problem.

First I attempted to recreate your problem on my end by performing a similar query on my end and looking at the threads used for queries of different size. Here are the results from that.

ReadMany Query Size	Time	Max Thread Count
500	00:00:03.82	7
1000	00:00:04.21	9
1500	00:00:03.61	9
2000	00:00:04.12	10
2001	00:00:03.99	13
2001 ReadItem Calls	00:03:11.68	10

From these results you can see the ReadMany API calls are not causing the max thread count to significantly go up as the number of items read increases and it is executing much faster than multiple ReadItem calls.

Now if we look at how the ReadMany API works in the SDK in terms of parallelization we can see a few thing. First is that here the SDK manages the maximum number of requests that can be granted concurrently which is 10*ProcessorCount (the test I preformed was on a machine that has 12 cores). 10 concurrent request per core is not that high and should not generate thread exhaustion as both the number of concurrent requests and the number of threads is dependent on the number of cores a machine has.

Next is how are the queries parallelized. Currently the SDK splits ReadMany calls into different queries in two ways. First the SDK will group the items by physical partition. Then within each physical partition it will further split the query into separate queries with no more than 1000 items queried. Splitting by physical partition will make sure that each query is not a cross partition query which will make thread pool starvation even less likely. To see how the queries are split I would recommend looking at this method in ReadManyQueryHelper, which generates the queries and concurrent Tasks that will be executed in parallel.

Please let me know if you have and additional questions about how ReadMany works and when you have the stack trace available so I can help find if ReadMany is the true culprit behind the thread pool starvation.

1reaction

tiesmastercommented, Feb 9, 2023

@NaluTripician Thank you for the write-up, and I totally agree with the analysis. This does not look like thread pool starvation, after all. It’s interesting that Kestrel triggers that log message as a result of its internal heartbeat, but then that most be result of something else happening in the system.

@ealsur Again, we’ve solved the problem differently, and moved on, and I don’t think there is anything to do on this issue at this stage, so I’ll close the issue… Thank you, and your team for your time, and effort on this.