[Bug] Large numbers of concurrent token refresh attempts cause a cache refresh convoy resulting in chronic 429 errors
See original GitHub issueDescribe the bug
In a scenario in which an application does the following:
- utilizes a
WithAppTokenProvider
with a callback configured to fetch tokens from a managed identity endpoint - Issues hundreds of
GetToken
requests simultaneously - Optionally increases
MaxRetries
in the HttpClient pipeline used to fetch tokens
When such an application encounters a 429 response from the MI endpoint, this can result in a storm of requests and retry requests making the 429 problem worse. In addition, given the current behavior in MSAL for retries and cache access, all retries are guaranteed to result in a cache miss and will continue to fail as long as the MI endpoint does not return a successful token response.
Expected behavior
Retry attempts after the token cache is successfully refreshed should succeed via a cache hit rather than through a network request to the MI endpoint or authority. Only one request should be made to the endpoint to refresh the cache for any given cache entry and all other concurrent requests should consume that single result.
Actual behavior
Once the initial request fails with a retriable status code, all subsequent token requests do not attempt to read the cache and always result in an additional network request.
Reproduction Steps
Issue a large number of simultaneous GetToken
requests with a ManagedIdentityCredential
to induce a 429 response from the MI endpoint
Environment
Customer example was in Service Fabric, but this should reproduce in any managed identity environment in which a 429 response is possible.
Issue Analytics
- State:
- Created 3 months ago
- Comments:13 (4 by maintainers)
Top GitHub Comments
Updating just MIcrosoft.Identity.Client package didn’t help
@bgavrilMS - just to make sure - when you wrote “Can you try to upgrade to use MSAL 5.54 or higher?” - you meant 4.54, right?