question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mongo.Hangfire service eventually creates too many threads and hangs

See original GitHub issue

Hello,

I’ve run into problems with a service that uses hangfire to schedule a recurring Job. Hangfire.Mongo v0.2.5 Hangfire.Core v1.5.3

The service appears to run fine for roughly a week-10 days, where it uses about 50-70 threads and 75-150MB of memory.

Then, the service will start creeping up in thread count, at precisely one thread per second- I have verified this twice now using resmon. After the service reaches 30,000+ threads, it freezes and is unrecoverable (as you might expect). By the time this happens, the service is using 1.3GB of memory, nearly all of that is thread overhead.

I have been watching this issue in Prod for about a month, and have generated two different memory dumps from two different instances of the process in this state. Both times, it appears that nearly 100% of the threads created by the service were from mongo.hangfire. The threads share similar characteristics:

Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3

This is accompanied by the following log statements: Message: Error occurred during execution of 'DelayedJobScheduler' process. Execution will be retried (attempt 9 of 2147483647) in 00:01:19 seconds.

Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:locks:schedulepoller': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.DelayedJobScheduler.EnqueueNextScheduledJob(BackgroundProcessContext context) at Hangfire.Server.DelayedJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

Message: Error occurred during execution of 'RecurringJobScheduler' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:01 seconds.

Exception: Hangfire.Mongo.DistributedLock.MongoDistributedLockException: Could not place a lock on the resource 'HangFire:recurring-jobs:lock': The lock request timed out. at Hangfire.Mongo.DistributedLock.MongoDistributedLock..ctor(String resource, TimeSpan timeout, HangfireDbContext database, MongoStorageOptions options) at Hangfire.Mongo.MongoConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

These statements appear 2-3 days before the service goes offline permanently, increasing in frequency until it freezes.

I’ve attached a memory dump analysis as generated by the windows debug diagnostic tool.

Memory_Report__Date_04_01_2016__Time_03_54_35PM__44.zip

Here’s a stacktrace of an example thread from the debug dump. (I checked a few dozen threads, they were all similar:)

ntdll.dll!ZwWaitForMultipleObjects+0xa KERNELBASE.dll!WaitForMultipleObjectsEx+0xed clr.dll!CreateApplicationContext+0xd1da clr.dll!CreateApplicationContext+0xcfde clr.dll!CreateApplicationContext+0xcdf5 clr.dll!CreateApplicationContext+0xd0a1 clr.dll!DllGetClassObjectInternal+0x7847 clr.dll!DllGetClassObjectInternal+0x7815 clr.dll!DllGetClassObjectInternal+0x75d5 [Managed to Unmanaged Transition] mscorlib.dll!System.Threading.ManualResetEventSlim.Wait+0x3ec mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait+0xdb mscorlib.dll!System.Threading.Tasks.Task.InternalWait+0x24a mscorlib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification+0x6b Hangfire.Mongo.dll!Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__7+0xa3 mscorlib.dll!System.Threading.ExecutionContext.RunInternal+0x285 mscorlib.dll!System.Threading.ExecutionContext.Run+0x9 mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback+0x172 mscorlib.dll!System.Threading.TimerQueueTimer.Fire+0x10e mscorlib.dll!System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem+0x43 mscorlib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch+0x1ea [Unmanaged to Managed Transition] clr.dll+0xa7f3 clr.dll+0xa6de clr.dll+0xae76 clr.dll!GetMetaDataInternalInterface+0x31d01 clr.dll+0xc121 clr.dll+0xc0a8 clr.dll+0xc019 clr.dll+0xc15f clr.dll!GetMetaDataInternalInterface+0x31c8e clr.dll!GetMetaDataInternalInterface+0x30b26 clr.dll!GetMetaDataInternalInterface+0x30a1a clr.dll!CopyPDBs+0x44a2 KERNEL32.dll!BaseThreadInitThunk+0x22 ntdll.dll!RtlUserThreadStart+0x34

Finally, I found this on the discussion thread in postgres:

link Looks like the problems may be similar?

Thanks!

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Bonklescommented, Aug 9, 2016

We have not solved the problem. It has reproduced for us many times and we eventually had to revert back to Hangfire.redis to work around this issue.

1reaction
briangwebercommented, Aug 9, 2016

No, we haven’t - in fact, our production servers just crashed this morning (10k+ threads, non-responsive) as a result of this issue. I’m in the process of reverting back to Redis as a backing store.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mongo.Hangfire service eventually creates too many ...
After the service reaches 30,000+ threads, it freezes and is unrecoverable (as you might expect). By the time this happens, the service is...
Read more >
Too many threads - question
I have 79 “queues” which contain 285 “workers”. When I check how many hangfire threads are running with the following command I get...
Read more >
Hangfire using MongoDB to execute Long-running ...
The solution implemented for my problem was to use this filter to set a distributed lock on the job until it is properly...
Read more >
Hangfire w/ Mongo using Docker
This article is a basic tutorial on how to setup Hangfire using a dotnet core application complementing with a docker setup.
Read more >
Background Jobs | Documentation Center | ABP.IO
Background jobs are used to queue some tasks to be executed in the background. You may need background jobs for several reasons. Here...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found