Mongo.Hangfire service eventually creates too many threads and hangs
See original GitHub issueHello,
I’ve run into problems with a service that uses hangfire to schedule a recurring Job. Hangfire.Mongo v0.2.5 Hangfire.Core v1.5.3
The service appears to run fine for roughly a week-10 days, where it uses about 50-70 threads and 75-150MB of memory.
Then, the service will start creeping up in thread count, at precisely one thread per second- I have verified this twice now using resmon. After the service reaches 30,000+ threads, it freezes and is unrecoverable (as you might expect). By the time this happens, the service is using 1.3GB of memory, nearly all of that is thread overhead.
I have been watching this issue in Prod for about a month, and have generated two different memory dumps from two different instances of the process in this state. Both times, it appears that nearly 100% of the threads created by the service were from mongo.hangfire. The threads share similar characteristics:
Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3
This is accompanied by the following log statements:
Message: Error occurred during execution of 'DelayedJobScheduler' process. Execution will be retried (attempt 9 of 2147483647) in 00:01:19 seconds.
Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:locks:schedulepoller': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.DelayedJobScheduler.EnqueueNextScheduledJob(BackgroundProcessContext context) at Hangfire.Server.DelayedJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
Message: Error occurred during execution of 'RecurringJobScheduler' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:01 seconds.
Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:recurring-jobs:lock': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
These statements appear 2-3 days before the service goes offline permanently, increasing in frequency until it freezes.
I’ve attached a memory dump analysis as generated by the windows debug diagnostic tool.
Memory_Report__Date_04_01_2016__Time_03_54_35PM__44.zip
Here’s a stacktrace of an example thread from the debug dump. (I checked a few dozen threads, they were all similar:)
ntdll.dll!ZwWaitForMultipleObjects+0xa KERNELBASE.dll!WaitForMultipleObjectsEx+0xed clr.dll!CreateApplicationContext+0xd1da clr.dll!CreateApplicationContext+0xcfde clr.dll!CreateApplicationContext+0xcdf5 clr.dll!CreateApplicationContext+0xd0a1 clr.dll!DllGetClassObjectInternal+0x7847 clr.dll!DllGetClassObjectInternal+0x7815 clr.dll!DllGetClassObjectInternal+0x75d5 [Managed to Unmanaged Transition] mscorlib.dll!System.Threading.ManualResetEventSlim.Wait+0x3ec mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait+0xdb mscorlib.dll!System.Threading.Tasks.Task.InternalWait+0x24a mscorlib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification+0x6b Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3 mscorlib.dll!System.Threading.ExecutionContext.RunInternal+0x285 mscorlib.dll!System.Threading.ExecutionContext.Run+0x9 mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback+0x172 mscorlib.dll!System.Threading.TimerQueueTimer.Fire+0x10e mscorlib.dll!System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem+0x43 mscorlib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch+0x1ea [Unmanaged to Managed Transition] clr.dll+0xa7f3 clr.dll+0xa6de clr.dll+0xae76 clr.dll!GetMetaDataInternalInterface+0x31d01 clr.dll+0xc121 clr.dll+0xc0a8 clr.dll+0xc019 clr.dll+0xc15f clr.dll!GetMetaDataInternalInterface+0x31c8e clr.dll!GetMetaDataInternalInterface+0x30b26 clr.dll!GetMetaDataInternalInterface+0x30a1a clr.dll!CopyPDBs+0x44a2 KERNEL32.dll!BaseThreadInitThunk+0x22 ntdll.dll!RtlUserThreadStart+0x34
Finally, I found this on the discussion thread in postgres:
link Looks like the problems may be similar?
Thanks!
Issue Analytics
- State:
- Created 7 years ago
- Comments:12 (4 by maintainers)
We have not solved the problem. It has reproduced for us many times and we eventually had to revert back to Hangfire.redis to work around this issue.
No, we haven’t - in fact, our production servers just crashed this morning (10k+ threads, non-responsive) as a result of this issue. I’m in the process of reverting back to Redis as a backing store.