Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mongo.Hangfire service eventually creates too many threads and hangs

See original GitHub issue

Hello,

I’ve run into problems with a service that uses hangfire to schedule a recurring Job. Hangfire.Mongo v0.2.5 Hangfire.Core v1.5.3

The service appears to run fine for roughly a week-10 days, where it uses about 50-70 threads and 75-150MB of memory.

Then, the service will start creeping up in thread count, at precisely one thread per second- I have verified this twice now using resmon. After the service reaches 30,000+ threads, it freezes and is unrecoverable (as you might expect). By the time this happens, the service is using 1.3GB of memory, nearly all of that is thread overhead.

I have been watching this issue in Prod for about a month, and have generated two different memory dumps from two different instances of the process in this state. Both times, it appears that nearly 100% of the threads created by the service were from mongo.hangfire. The threads share similar characteristics:

Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3

This is accompanied by the following log statements: Message: Error occurred during execution of 'DelayedJobScheduler' process. Execution will be retried (attempt 9 of 2147483647) in 00:01:19 seconds.

Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:locks:schedulepoller': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.DelayedJobScheduler.EnqueueNextScheduledJob(BackgroundProcessContext context) at Hangfire.Server.DelayedJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

Message: Error occurred during execution of 'RecurringJobScheduler' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:01 seconds.

Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:recurring-jobs:lock': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

These statements appear 2-3 days before the service goes offline permanently, increasing in frequency until it freezes.

I’ve attached a memory dump analysis as generated by the windows debug diagnostic tool.

Memory_Report__Date_04_01_2016__Time_03_54_35PM__44.zip

Here’s a stacktrace of an example thread from the debug dump. (I checked a few dozen threads, they were all similar:)

ntdll.dll!ZwWaitForMultipleObjects+0xa KERNELBASE.dll!WaitForMultipleObjectsEx+0xed clr.dll!CreateApplicationContext+0xd1da clr.dll!CreateApplicationContext+0xcfde clr.dll!CreateApplicationContext+0xcdf5 clr.dll!CreateApplicationContext+0xd0a1 clr.dll!DllGetClassObjectInternal+0x7847 clr.dll!DllGetClassObjectInternal+0x7815 clr.dll!DllGetClassObjectInternal+0x75d5 [Managed to Unmanaged Transition] mscorlib.dll!System.Threading.ManualResetEventSlim.Wait+0x3ec mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait+0xdb mscorlib.dll!System.Threading.Tasks.Task.InternalWait+0x24a mscorlib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification+0x6b Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3 mscorlib.dll!System.Threading.ExecutionContext.RunInternal+0x285 mscorlib.dll!System.Threading.ExecutionContext.Run+0x9 mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback+0x172 mscorlib.dll!System.Threading.TimerQueueTimer.Fire+0x10e mscorlib.dll!System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem+0x43 mscorlib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch+0x1ea [Unmanaged to Managed Transition] clr.dll+0xa7f3 clr.dll+0xa6de clr.dll+0xae76 clr.dll!GetMetaDataInternalInterface+0x31d01 clr.dll+0xc121 clr.dll+0xc0a8 clr.dll+0xc019 clr.dll+0xc15f clr.dll!GetMetaDataInternalInterface+0x31c8e clr.dll!GetMetaDataInternalInterface+0x30b26 clr.dll!GetMetaDataInternalInterface+0x30a1a clr.dll!CopyPDBs+0x44a2 KERNEL32.dll!BaseThreadInitThunk+0x22 ntdll.dll!RtlUserThreadStart+0x34

Finally, I found this on the discussion thread in postgres:

link Looks like the problems may be similar?

Thanks!

Issue Analytics

State:
Created 7 years ago
Comments:12 (4 by maintainers)

Top GitHub Comments

1reaction

Bonklescommented, Aug 9, 2016

We have not solved the problem. It has reproduced for us many times and we eventually had to revert back to Hangfire.redis to work around this issue.

1reaction

briangwebercommented, Aug 9, 2016

No, we haven’t - in fact, our production servers just crashed this morning (10k+ threads, non-responsive) as a result of this issue. I’m in the process of reverting back to Redis as a backing store.