question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unusual high number of connections to Cosmos DB instance

See original GitHub issue

Hi 👋

Describe the bug

This night we experienced a very strange issue with our CosmosDB instance. It all started with a triggered alert which is monitoring app service memory %.

image

Once I checked logs I was seeing things like

{
    "Timestamp": "2022-02-03T07:46:23.5098313+00:00",
    "Level": "Error",
    "MessageTemplate": "Redis connection error restored.",
    "Properties": {
        "SourceContext": "StackExchange.Redis.Extensions.Core.Implementations.RedisCacheConnectionPoolManager",
        "MachineName": "..."
    }
}
{"Timestamp":"2022-02-03T07:37:03.8894714+00:00","Level":"Error","MessageTemplate":"Could not store event in Redis","Exception":"StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: SETEX events_belfasttrust_RI0nO_5GSp; UnableToConnect on xxx.redis.cache.windows.net:6380/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.2.4.27433, mc: 1/1/0, mgr: 10 of 10 available, clientName: PD1SDWK00003F, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=8,Free=32759,Min=1,Max=32767), v: 2.2.4.27433\r\n
 ---> StackExchange.Redis.RedisConnectionException: UnableToConnect on xxx.redis.cache.windows.net:6380/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.2.4.27433\r\n
   at StackExchange.Redis.TaskExtensions.TimeoutAfter(Task task, Int32 timeoutMs) in /_/src/StackExchange.Redis/TaskExtensions.cs:line 55\r\n
   at StackExchange.Redis.ConnectionMultiplexer.WaitAllIgnoreErrorsAsync(Task[] tasks, Int32 timeoutMilliseconds, LogProxy log, String caller, Int32 callerLineNumber) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 740\r\n  
   --- End of inner exception stack trace ---\r\n
   at StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource`1 source, Exception unthrownException) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2760\r\n--- End of stack trace from previous location ---\r\n
   at xxx.RedisDomainEventsCache.AddAsync(IExternalDomainEvent message) in C:\\agent\\_work\\2\\s\\src\\App\\Infrastructure\\Signalr\\RedisDomainEventsCache.cs:line 45","Properties":{"SourceContext":"xxx.Signalr.RedisDomainEventsCache","MachineName":"xx"}}

Nothing about CosmosDB yet.

That caused me to check connection utilization within app service instance and it was sky flying as well.

image

Very unusual for our case, because we don’t get much traffic in the night

image

In the morning around 8AM (UTC) I grabbed memory dump from the app service and opened it in dotMemory image

That somehow triggered me to have a look into app map in app insights and noticed we did 2.2M calls over last 12h to Cosmos which is super unusual as we store there tenants information which is requested around 66k times / day.

image

image

image

That’s how it looks like on a regular basis

image

What is awkward about this is the fact that we only keep data in a single UK South region

image

This is how we construct our CosmosClient singleton instance

services.AddSingleton(_ => new CosmosClient(
    cs,
    new CosmosClientOptions
    {
        ApplicationRegion = string.IsNullOrEmpty(region) ? Regions.UKSouth : region,
    }));

this is how we utilize it

public class AzureCosmosDbTenantStore : ITenantStore, IAsyncDisposable
{
    private static Container? _container = null;
    private static object _lock = new object();
    private static ConcurrentDictionary<string, TenantDetails>? _tenants;
    private readonly CosmosClient _cosmosClient;
    private readonly string _database;
    private readonly string _containerName;
    private readonly SecretClient _secretClient;
    private readonly IOptions<MultiTenantOptions> _multiTenantOptions;
    private readonly ITenantDbSettings _tenantDbSettings;
    private readonly ILogger<AzureCosmosDbTenantStore> _logger;
    private ChangeFeedProcessor? _processor;
    private bool _disposed;

    public AzureCosmosDbTenantStore(
        CosmosClient cosmosClient,
        string database,
        string containerName,
        SecretClient secretClient,
        IOptions<MultiTenantOptions> multiTenantOptions,
        ITenantDbSettings tenantDbSettings,
        ILogger<AzureCosmosDbTenantStore> logger)
    {
        _cosmosClient = cosmosClient;
        _database = database;
        _containerName = containerName;
        _secretClient = secretClient;
        _multiTenantOptions = multiTenantOptions;
        _tenantDbSettings = tenantDbSettings;
        _logger = logger;
    }

    public async Task<Tenant> GetTenantAsync(string identifier, CancellationToken cancellationToken = default)
    {
        var tenants = await GetAsync(cancellationToken);
        var tenant = tenants.FirstOrDefault(x => x.Identifier == identifier) ??
                             throw new Exception($"Tenant with identifier {identifier} not found.");
        if (_tenants != null && !_tenants[tenant.Identifier].SecretsLoaded)
        {
            await ReadTenantSecretsAsync(_tenants[tenant.Identifier]);
        }

        return tenant;
    }

    public async Task<Tenant[]> GetAsync(CancellationToken cancellationToken = default)
    {
        if (_tenants != null)
        {
            return _tenants.Select(x => x.Value.Tenant).ToArray();
        }

        var container = await GetContainer(cancellationToken);
        var queryable = container.GetItemLinqQueryable<CosmosTenantInfo>();
        var iterator = queryable.ToFeedIterator();
        var read = await iterator.ReadNextAsync(cancellationToken);
        _tenants = new ConcurrentDictionary<string, TenantDetails>(read.Resource
            .Select(Tenant)
            .Where(x => x != null)
            .OfType<TenantDetails>()
            .ToDictionary(x => x.Tenant.Identifier, x => x));

        foreach (var (_, details) in _tenants)
        {
            if (!details.SecretsLoaded)
            {
                await ReadTenantSecretsAsync(details);
            }
        }

        return _tenants.Select(x => x.Value.Tenant).ToArray();
    }

    public async ValueTask DisposeAsync()
    {
        await Dispose(true);
        GC.SuppressFinalize(this);
    }

    private async Task<Tenant> ReadTenantSecretsAsync(TenantDetails tenantDetails)
    {
        var tenant = tenantDetails.Tenant;
        try
        {
            var connectionString = await GetSecretValueAsync(tenant, _secretClient, "ConnectionString");
            var jwtKey = await GetSecretValueAsync(tenant, _secretClient, "Jwt-Key");

            tenant.Configuration
                .WithConnectionString(_tenantDbSettings, connectionString, _multiTenantOptions.Value.DbPattern)
                .WithAuthSettings(new AuthSettings(jwtKey));

            ..

            tenantDetails.SecretsRead();
            tenant.Initialized();
        }
        catch (Exception e)
        {
            _logger.LogError(e, "Could not read tenant configuration for tenant with id '{TenantId}'", tenant.Id);
        }

        return tenant;
    }

    private async Task<string> GetSecretValueAsync(Tenant tenant, SecretClient client, string keyName)
    {
        try
        {
            var response = await client.GetSecretAsync(Key(tenant, keyName));
            return response.Value.Value;
        }
        catch (Exception e)
        {
            _logger.LogError(e, "Issue retrieving secret from key vault");
            return string.Empty;
        }
    }

    private string Key(Tenant tenant, string keyName)
        => $"{tenant.Identifier}-{keyName}".ToLower();

    private TenantDetails? Tenant(CosmosTenantInfo x)
    {
        ..
    }

    private async ValueTask Dispose(bool disposing)
    {
        if (_disposed)
        {
            return;
        }

        if (disposing)
        {
            if (_processor != null)
            {
                await _processor.StopAsync();
            }
        }

        _disposed = true;
    }

    private async Task<Container> GetContainer(CancellationToken cancellationToken)
    {
        if (_container != null)
        {
            return _container;
        }

        var properties = new ContainerProperties(_containerName, "/partition");
        var databaseResponse =
                await _cosmosClient.CreateDatabaseIfNotExistsAsync(_database, cancellationToken: cancellationToken);
        var database = databaseResponse.Database;
        var containerResponse =
                await database.CreateContainerIfNotExistsAsync(properties, cancellationToken: cancellationToken);
        var leaseContainer =
            await database.CreateContainerIfNotExistsAsync(new ContainerProperties($"{_containerName}_lease", "/id"), cancellationToken: cancellationToken);
        lock (_lock)
        {
            var container = containerResponse.Container;
            _container = container;

            if (_processor == null)
            {
                var builder = _container.GetChangeFeedProcessorBuilder(
                    "watch",
                    (IReadOnlyCollection<CosmosTenantInfo> input, CancellationToken _) =>
                    {
                        foreach (var change in input)
                        {
                            var item = _tenants?.FirstOrDefault(x => x.Key == change.Id);
                            if (item == null || string.IsNullOrEmpty(item.Value.Key))
                            {
                                var tenantDetails = Tenant(change);
                                if (tenantDetails != null)
                                {
                                    _tenants?.TryAdd(change.Id, tenantDetails);
                                }
                            }
                            else
                            {
                                if (_tenants != null)
                                {
                                    _tenants.TryGetValue(change.Id, out var found);
                                    if (found != null)
                                    {
                                        var tenantDetails = Tenant(change);
                                        if (tenantDetails != null)
                                        {
                                            _tenants[change.Id] = tenantDetails;
                                        }
                                    }
                                }
                            }
                        }

                        return Task.CompletedTask;
                    });

                _processor = builder
                    .WithInstanceName("changefeed")
                    .WithLeaseContainer(leaseContainer)
                    .Build();
            }
        }

        if (_processor != null)
        {
            await _processor.StartAsync();
        }

        return _container;
    }

    public class TenantDetails
    {
        public TenantDetails(Tenant tenant)
        {
            Tenant = tenant;
            SecretsLoaded = false;
        }

        public Tenant Tenant { get; }
        public bool SecretsLoaded { get; private set; }

        public TenantDetails SecretsRead()
        {
            SecretsLoaded = true;
            return this;
        }
    }

    public class CosmosTenantInfo
    {
        [JsonProperty(PropertyName = "id")]
        public string Id { get; set; } = string.Empty;
        [JsonProperty(PropertyName = "partition")]
        public string Partition { get; set; } = "tenants";

        public string Version { get; set; } = string.Empty;
        public string Identifier { get; set; } = string.Empty;
        public AmazonQuicksightInfo? AmazonQuicksightInfo { get; set; }
    }

}

Unfortunately, there was nothing in the logs mentioning any issues with Cosmos.

I’m looking for help to understand why app insights were showing such a big number of calls to Cosmos as it may be related to our memory issue. Thanks for your help.

To Reproduce Unfortunately, there is no reproduction I can give to you as it happened for the first time and after app service restart all is back to normal.

Expected behavior

Actual behavior

Environment summary SDK Version: 3.23.0 OS Version (e.g. Windows, Linux, MacOSX) Windows

Additional context

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
robertlysoncommented, Feb 3, 2022

@j82w thanks for your comment.

This code is not draining the query results completely.

Thanks for pointing this out 👍

  1. Try using ApplicationPreferredRegions instead of ApplicationRegion since the account in not geo-replicated.

Will do.

  1. Can you add a trace or log to verify the CosmosClient is only created once?

Yes, I always get the same hash code for _cosmosClient.GetHashCode() and implementation factory in services.AddSingleton(_ => new CosmosClient(..) is called once. I don’t construct CosmosClient manually it always goes via dependencies conainter.

  1. Can you provide at least 1 diagnostic string from any of the reads?

Will grab one from production.

0reactions
ealsurcommented, Feb 4, 2022

Not sure how string duplicates would match to connections, I think they are two different things. The first string in the list seems to be related to queries, so probably if you are executing queries or a high number of queries, that are all similar. The data in it seems related to Query Plans. If the app is running as Release on Windows, compiled as x64, and the ServiceInterop.dll is present, the Query Plan should not be needed, but the fact that is there it seems any of those conditions are not met and your workload is having a high number of queries.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Cosmos DB performance tips for .NET SDK v3
If you see a high connection volume or high port usage on your instances, first verify that your client instances are singletons.
Read more >
CosmosDB SocketException for parallel queries in Azure
Update 2020-10-24: It might be that the number of connections will be a limitation for parallel connections, but I don't know how to...
Read more >
How I learned to stop worrying and love Cosmos DB's ...
A quick guide that aims to clarify the concepts behind Request Units and explains how to properly and confidently provision throughput on Cosmos...
Read more >
ChaosDB explained: Azure's Cosmos DB vulnerability ... - Wiz
This is the full story of the Azure ChaosDB Vulnerability that was discovered and disclosed by the Wiz Research Team, where we were...
Read more >
How to Set Up Azure CosmosDB – Database Guide for ...
In this article, we will go through the basics of Azure Cosmos DB and understand the configuration options available with it.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found