Unusual high number of connections to Cosmos DB instance
See original GitHub issueHi 👋
Describe the bug
This night we experienced a very strange issue with our CosmosDB instance. It all started with a triggered alert which is monitoring app service memory %.
Once I checked logs I was seeing things like
{
"Timestamp": "2022-02-03T07:46:23.5098313+00:00",
"Level": "Error",
"MessageTemplate": "Redis connection error restored.",
"Properties": {
"SourceContext": "StackExchange.Redis.Extensions.Core.Implementations.RedisCacheConnectionPoolManager",
"MachineName": "..."
}
}
{"Timestamp":"2022-02-03T07:37:03.8894714+00:00","Level":"Error","MessageTemplate":"Could not store event in Redis","Exception":"StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: SETEX events_belfasttrust_RI0nO_5GSp; UnableToConnect on xxx.redis.cache.windows.net:6380/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.2.4.27433, mc: 1/1/0, mgr: 10 of 10 available, clientName: PD1SDWK00003F, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=8,Free=32759,Min=1,Max=32767), v: 2.2.4.27433\r\n
---> StackExchange.Redis.RedisConnectionException: UnableToConnect on xxx.redis.cache.windows.net:6380/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.2.4.27433\r\n
at StackExchange.Redis.TaskExtensions.TimeoutAfter(Task task, Int32 timeoutMs) in /_/src/StackExchange.Redis/TaskExtensions.cs:line 55\r\n
at StackExchange.Redis.ConnectionMultiplexer.WaitAllIgnoreErrorsAsync(Task[] tasks, Int32 timeoutMilliseconds, LogProxy log, String caller, Int32 callerLineNumber) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 740\r\n
--- End of inner exception stack trace ---\r\n
at StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource`1 source, Exception unthrownException) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2760\r\n--- End of stack trace from previous location ---\r\n
at xxx.RedisDomainEventsCache.AddAsync(IExternalDomainEvent message) in C:\\agent\\_work\\2\\s\\src\\App\\Infrastructure\\Signalr\\RedisDomainEventsCache.cs:line 45","Properties":{"SourceContext":"xxx.Signalr.RedisDomainEventsCache","MachineName":"xx"}}
Nothing about CosmosDB yet.
That caused me to check connection utilization within app service instance and it was sky flying as well.
Very unusual for our case, because we don’t get much traffic in the night
In the morning around 8AM (UTC) I grabbed memory dump from the app service and opened it in dotMemory
That somehow triggered me to have a look into app map in app insights and noticed we did 2.2M calls over last 12h to Cosmos which is super unusual as we store there tenants information which is requested around 66k times / day.
That’s how it looks like on a regular basis
What is awkward about this is the fact that we only keep data in a single UK South region
This is how we construct our CosmosClient
singleton instance
services.AddSingleton(_ => new CosmosClient(
cs,
new CosmosClientOptions
{
ApplicationRegion = string.IsNullOrEmpty(region) ? Regions.UKSouth : region,
}));
this is how we utilize it
public class AzureCosmosDbTenantStore : ITenantStore, IAsyncDisposable
{
private static Container? _container = null;
private static object _lock = new object();
private static ConcurrentDictionary<string, TenantDetails>? _tenants;
private readonly CosmosClient _cosmosClient;
private readonly string _database;
private readonly string _containerName;
private readonly SecretClient _secretClient;
private readonly IOptions<MultiTenantOptions> _multiTenantOptions;
private readonly ITenantDbSettings _tenantDbSettings;
private readonly ILogger<AzureCosmosDbTenantStore> _logger;
private ChangeFeedProcessor? _processor;
private bool _disposed;
public AzureCosmosDbTenantStore(
CosmosClient cosmosClient,
string database,
string containerName,
SecretClient secretClient,
IOptions<MultiTenantOptions> multiTenantOptions,
ITenantDbSettings tenantDbSettings,
ILogger<AzureCosmosDbTenantStore> logger)
{
_cosmosClient = cosmosClient;
_database = database;
_containerName = containerName;
_secretClient = secretClient;
_multiTenantOptions = multiTenantOptions;
_tenantDbSettings = tenantDbSettings;
_logger = logger;
}
public async Task<Tenant> GetTenantAsync(string identifier, CancellationToken cancellationToken = default)
{
var tenants = await GetAsync(cancellationToken);
var tenant = tenants.FirstOrDefault(x => x.Identifier == identifier) ??
throw new Exception($"Tenant with identifier {identifier} not found.");
if (_tenants != null && !_tenants[tenant.Identifier].SecretsLoaded)
{
await ReadTenantSecretsAsync(_tenants[tenant.Identifier]);
}
return tenant;
}
public async Task<Tenant[]> GetAsync(CancellationToken cancellationToken = default)
{
if (_tenants != null)
{
return _tenants.Select(x => x.Value.Tenant).ToArray();
}
var container = await GetContainer(cancellationToken);
var queryable = container.GetItemLinqQueryable<CosmosTenantInfo>();
var iterator = queryable.ToFeedIterator();
var read = await iterator.ReadNextAsync(cancellationToken);
_tenants = new ConcurrentDictionary<string, TenantDetails>(read.Resource
.Select(Tenant)
.Where(x => x != null)
.OfType<TenantDetails>()
.ToDictionary(x => x.Tenant.Identifier, x => x));
foreach (var (_, details) in _tenants)
{
if (!details.SecretsLoaded)
{
await ReadTenantSecretsAsync(details);
}
}
return _tenants.Select(x => x.Value.Tenant).ToArray();
}
public async ValueTask DisposeAsync()
{
await Dispose(true);
GC.SuppressFinalize(this);
}
private async Task<Tenant> ReadTenantSecretsAsync(TenantDetails tenantDetails)
{
var tenant = tenantDetails.Tenant;
try
{
var connectionString = await GetSecretValueAsync(tenant, _secretClient, "ConnectionString");
var jwtKey = await GetSecretValueAsync(tenant, _secretClient, "Jwt-Key");
tenant.Configuration
.WithConnectionString(_tenantDbSettings, connectionString, _multiTenantOptions.Value.DbPattern)
.WithAuthSettings(new AuthSettings(jwtKey));
..
tenantDetails.SecretsRead();
tenant.Initialized();
}
catch (Exception e)
{
_logger.LogError(e, "Could not read tenant configuration for tenant with id '{TenantId}'", tenant.Id);
}
return tenant;
}
private async Task<string> GetSecretValueAsync(Tenant tenant, SecretClient client, string keyName)
{
try
{
var response = await client.GetSecretAsync(Key(tenant, keyName));
return response.Value.Value;
}
catch (Exception e)
{
_logger.LogError(e, "Issue retrieving secret from key vault");
return string.Empty;
}
}
private string Key(Tenant tenant, string keyName)
=> $"{tenant.Identifier}-{keyName}".ToLower();
private TenantDetails? Tenant(CosmosTenantInfo x)
{
..
}
private async ValueTask Dispose(bool disposing)
{
if (_disposed)
{
return;
}
if (disposing)
{
if (_processor != null)
{
await _processor.StopAsync();
}
}
_disposed = true;
}
private async Task<Container> GetContainer(CancellationToken cancellationToken)
{
if (_container != null)
{
return _container;
}
var properties = new ContainerProperties(_containerName, "/partition");
var databaseResponse =
await _cosmosClient.CreateDatabaseIfNotExistsAsync(_database, cancellationToken: cancellationToken);
var database = databaseResponse.Database;
var containerResponse =
await database.CreateContainerIfNotExistsAsync(properties, cancellationToken: cancellationToken);
var leaseContainer =
await database.CreateContainerIfNotExistsAsync(new ContainerProperties($"{_containerName}_lease", "/id"), cancellationToken: cancellationToken);
lock (_lock)
{
var container = containerResponse.Container;
_container = container;
if (_processor == null)
{
var builder = _container.GetChangeFeedProcessorBuilder(
"watch",
(IReadOnlyCollection<CosmosTenantInfo> input, CancellationToken _) =>
{
foreach (var change in input)
{
var item = _tenants?.FirstOrDefault(x => x.Key == change.Id);
if (item == null || string.IsNullOrEmpty(item.Value.Key))
{
var tenantDetails = Tenant(change);
if (tenantDetails != null)
{
_tenants?.TryAdd(change.Id, tenantDetails);
}
}
else
{
if (_tenants != null)
{
_tenants.TryGetValue(change.Id, out var found);
if (found != null)
{
var tenantDetails = Tenant(change);
if (tenantDetails != null)
{
_tenants[change.Id] = tenantDetails;
}
}
}
}
}
return Task.CompletedTask;
});
_processor = builder
.WithInstanceName("changefeed")
.WithLeaseContainer(leaseContainer)
.Build();
}
}
if (_processor != null)
{
await _processor.StartAsync();
}
return _container;
}
public class TenantDetails
{
public TenantDetails(Tenant tenant)
{
Tenant = tenant;
SecretsLoaded = false;
}
public Tenant Tenant { get; }
public bool SecretsLoaded { get; private set; }
public TenantDetails SecretsRead()
{
SecretsLoaded = true;
return this;
}
}
public class CosmosTenantInfo
{
[JsonProperty(PropertyName = "id")]
public string Id { get; set; } = string.Empty;
[JsonProperty(PropertyName = "partition")]
public string Partition { get; set; } = "tenants";
public string Version { get; set; } = string.Empty;
public string Identifier { get; set; } = string.Empty;
public AmazonQuicksightInfo? AmazonQuicksightInfo { get; set; }
}
}
Unfortunately, there was nothing in the logs mentioning any issues with Cosmos.
I’m looking for help to understand why app insights were showing such a big number of calls to Cosmos as it may be related to our memory issue. Thanks for your help.
To Reproduce Unfortunately, there is no reproduction I can give to you as it happened for the first time and after app service restart all is back to normal.
Expected behavior
Actual behavior
Environment summary SDK Version: 3.23.0 OS Version (e.g. Windows, Linux, MacOSX) Windows
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
@j82w thanks for your comment.
Thanks for pointing this out 👍
Will do.
Yes, I always get the same hash code for
_cosmosClient.GetHashCode()
and implementation factory inservices.AddSingleton(_ => new CosmosClient(..)
is called once. I don’t constructCosmosClient
manually it always goes via dependencies conainter.Will grab one from production.
Not sure how string duplicates would match to connections, I think they are two different things. The first string in the list seems to be related to queries, so probably if you are executing queries or a high number of queries, that are all similar. The data in it seems related to Query Plans. If the app is running as Release on Windows, compiled as x64, and the ServiceInterop.dll is present, the Query Plan should not be needed, but the fact that is there it seems any of those conditions are not met and your workload is having a high number of queries.