Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak when using MongoDb integration

See original GitHub issue

Describe the bug When the MongoDb integrations are present in the dotnet-integrations.json, our memory usage slowly increases at an approximate average rate of 5MB per hour. This continues until it has accumulated enough memory for our tasks in ECS/Fargate to run out of memory and kill/restart our containers.

To Reproduce Steps to reproduce the behavior:

Create a task in AWS ECS/Fargate with a .NET 5 service auto-instrumented with dd-trace-dotnet and standard dotnet-integrations.json
Hook up the periodic health check to run the Ping command on the Mongo server
Let the service sit idle for many days
See the memory usage of the task increase over time from within CloudWatch (or other monitoring tool)

Screenshots DdTraceDotnet-MongoMemoryLeak-1

Start to A = services sitting idle with version 1.26.1 and the full contents of datadog-integration.json
A = ECS/Fargate killed and restarted the tasks due to out of memory
A to B = no change, services eating memory and sitting idle after the restart
B to C = irrelevant tests
C = manual restart with a completely empty datadog-integration.json (only contents is an empty json array [])
C to D = sitting idle for a day with very little increase in memory
D = manual restart with all contents of datadog-integrations.json restored, but removed the MongoDb sections
D’ = enabled a Hangfire job that runs periodically to execute MongoDb queries
D to E = mostly sitting idle with very little increase in memory
E = manual restart with all contents of datadog-integrations.json restored (including MongoDb sections)
E to F = sitting idle with memory increase
F = manual restart after deploy with upgrade to 1.28.0 and updated datadog-integrations.json at that tag (includes MongoDb sections)
F to now = sitting idle and looks like memory increase is continuing with latest

Runtime environment (please complete the following information):

Instrumentation mode: Automatic with Debian APM installed in container
Tracer version: 1.26.1 and 1.28.0
OS: Container based on image mcr.microsoft.com/dotnet/aspnet:5.0 which is a Debian Buster-Slim distro
CLR: .NET 5

Issue Analytics

State:
Created 2 years ago
Comments:16 (7 by maintainers)

Top GitHub Comments

2reactions

andrewlockcommented, Aug 23, 2021

Thanks @hiiru for providing all the extra details, this is really useful.

The source of the problem

We think we’ve tracked down the source of the issue, unfortunately there appears to be multiple facets to this.

The MongoClient (incorrectly) flows the execution context from the current thread when it is being created. That means if there’s an active trace at the time the MongoClient is created, the background thread that checks the status of the cluster inherits that active trace. Even if the scope is closed on another thread, the background thread will keep adding spans to the trace.
If you register your MongoClient as a singleton service in ASP.NET Core’s DI container, it is “lazily” created when it is first needed. That’s very application-dependent, but it’s likely to be inside a request trace or something similar, which means it’s likely to capture a trace.
If the parent trace is “long-lived”, as opposed to a typical “short-lived” request trace, then the trace will not be flushed, and the background thread will continue adding spans to it, causing the memory leak.

Solutions

Depending on the exact behaviour you’re seeing, and the nature of the issue, there are various ways to mitigate/fix the issue.

1. Stop flowing execution context in `MongoClient`

This is the real solution to the problem, which would involve an update to the Mongo Csharp Driver to prevent flowing the execution context. I’d suggest filing a support case with Mongo - as a customer hopefully this will be prioritised!

2. Create the `MongoClient` ahead of time

Depending on your application, you may find that you can solve the problem by registering your singleton MongoClient instance in DI directly. For example, if you were previously doing this:

public void ConfigureServices(IServiceCollection services)
{
    services.AddSingleton<IMongoClient>(_ => new MongoClient(Configuration.GetConnectionString("mongo")));
}

Do this instead:

public void ConfigureServices(IServiceCollection services)
{
    var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    services.AddSingleton<IMongoClient>(mongoClient);
}

This ensures that the context is captured when building Startup, instead of when the client is first used. As long as you aren’t creating long-lived scopes that encompass this part of the app lifecycle (which we would not recommend anyway) then this may solve your issue.

More generally, if you’re not using this DI pattern, ensure that you’re not creating the MongoClient inside a “long-lived” trace/scope. If you are creating the client inside an existing scope, you can ensure you don’t capture the context by calling ExecutionContext.SuppressFlow(), for example:

public void ConfigureServices(IServiceCollection services)
{
    using(_ = System.Threading.ExecutionContext.SuppressFlow())
    {
        var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    }
    services.AddSingleton<IMongoClient>(mongoClient);
}

3. Enable partial flush

We generally advise against creating long-lived traces. By design, traces remain in memory until the last span is closed, so if you have a long-lived trace with many child spans, these will use an increasing amount of memory as the app lifetime increases.

In some cases, you can mitigate this issue by enabling partial flush for the .NET Tracer, as described in the documentation. If the above solution isn’t possible or doesn’t resolve the issue, it may be worth trying DD_TRACE_PARTIAL_FLUSH_ENABLED=true.

Other possible enhancements

There seems to be little utility in seeing these infrastructural isMaster spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other “infrastructural” commands), allowing enabling them with an environment variable?

1reaction

hiirucommented, Aug 26, 2021

After 2 days with the SuppressFlow fix, the memory still looks good. This definitely fixed the issue. 👍