Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Job stuck in the Processing state (InvisibilityTimeout not working properly)

See original GitHub issue

During some tests where our server is being restarted while jobs are being executed, we find out, that sometimes, jobs became stuck in the Processing state, and they never recover.

After we start the server, those jobs are never executed again no matter how much time we wait (we’ve waited for the invisibility timeout and nothing happens, we had left it for a few hours, and nothing happens as well)

The Hangfire UI seems to know, that the job is stuck/aborted as it shows the following message

The job was aborted – it is processed by server testpc:server1:33704:7daebff8-39b2-4899-9f5f-3dae35a39863 which is not in the active servers list for now. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.

After digging into this issue a little bit more, I’ve found out a weird behaviour that I think is causing this, and it’s related to InvisibilityTimeout.

In our example we have an InvisibilityTimeout of 2min (it can be 30min, it won’t make a difference). We have some long-running jobs that may exceed our InvisibilityTimeout (for our usecase, we are OK with that)

When the InvisibilityTimeout is exceeded, the job CancelationToken is triggered and the worker that is running that job aborts that execution. At the same time, a new worker will pick that job and run it again. This is the expected behaviour (AFIAK).

The problem is that this only happens once, the worker that is now processing the long-running job, will no longer receive any kind of cancellation when the InvisibilityTimeout is exceeded. The same happens, if the server is restarted at this point, the job will no longer be processed again, and will be stuck in “Processing”.

This seems to be happening, because when the InvisibilityTimeout cancels the first execution, in the Worker.Execute(BackgroundProcessContext context) it will remove the job from the queue, despite the fact that the job is still being processed by another worker.

This happens in these lines:

In the Worker.Execute(BackgroundProcessContext context)

var state = PerformJob(context, connection, fetchedJob.JobId);
>>>> In this example state is null

if (state != null)
{
	// Ignore return value, because we should not do anything when the current state is not Processing.
	TryChangeState(context, connection, fetchedJob, state, new[] { ProcessingState.StateName }, CancellationToken.None, context.ShutdownToken);
}

>>>>> I'm not sure if the following assumption is correct, because in our case the job was not performed, 
>>>>>it is still being executed (in another worker), so I don't think it should be removed from the queue

// Checkpoint #4. The job was performed, and it is in the one
// of the explicit states (Succeeded, Scheduled and so on).
// It should not be re-queued, but we still need to remove its
// processing information.

requeueOnException = false;
fetchedJob.RemoveFromQueue();

Steps to reproduce

Download this repro project: https://github.com/braca/hangfire-mongo-bug
Run the project and go to: http://localhost:5000/swagger
POST /Test/LongJobtesting
Wait a few minutes
The job will be running for 1minute in one of the available workers
After 1min, the CancellationToken will be triggered (because of the InvisibilityTimeout)
A new worker will pick the job and execute it
This new worker will never receive any CancelationToken even if the job execution time exceeds the InvisibilityTimeout
Restart the server
The job will be stuck in Processing and it will never be picked by any worker
You can check the logs/app.logs and confirm this

Logs

16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Started
16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 0 - Execution time: 00:00:00.0025689
16:50:47.244 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 1 - Execution time: 00:00:10.0158631
16:50:57.253 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 2 - Execution time: 00:00:20.0250140
16:51:07.256 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 3 - Execution time: 00:00:30.0278617
16:51:17.263 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 4 - Execution time: 00:00:40.0343608
16:51:27.262 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 5 - Execution time: 00:00:50.0336615
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - Started
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 0 - Execution time: 00:00:00.0010493
16:51:37.278 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 6 - Execution time: 00:01:00.0498272
16:51:37.320 [ERR] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Token was cancelled - Execution time: 00:01:00.0917212
16:51:47.265 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 1 - Execution time: 00:00:10.0047696
16:51:57.280 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 2 - Execution time: 00:00:20.0198516
16:52:07.281 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 3 - Execution time: 00:00:30.0211184
16:52:17.294 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 4 - Execution time: 00:00:40.0335476
16:52:27.311 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 5 - Execution time: 00:00:50.0509575
16:52:37.321 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 6 - Execution time: 00:01:00.0607559
16:52:47.333 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 7 - Execution time: 00:01:10.0725911
16:52:57.352 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 8 - Execution time: 00:01:20.0913322
16:53:07.363 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 9 - Execution time: 00:01:30.1025756
16:53:17.378 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 10 - Execution time: 00:01:40.1181397