Job stuck in the Processing state (InvisibilityTimeout not working properly)
See original GitHub issueDuring some tests where our server is being restarted while jobs are being executed, we find out, that sometimes, jobs became stuck in the Processing state, and they never recover.
After we start the server, those jobs are never executed again no matter how much time we wait (we’ve waited for the invisibility timeout and nothing happens, we had left it for a few hours, and nothing happens as well)
The Hangfire UI seems to know, that the job is stuck/aborted as it shows the following message
The job was aborted – it is processed by server testpc:server1:33704:7daebff8-39b2-4899-9f5f-3dae35a39863 which is not in the active servers list for now. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.
After digging into this issue a little bit more, I’ve found out a weird behaviour that I think is causing this, and it’s related to InvisibilityTimeout
.
In our example we have an InvisibilityTimeout of 2min (it can be 30min, it won’t make a difference). We have some long-running jobs that may exceed our InvisibilityTimeout (for our usecase, we are OK with that)
When the InvisibilityTimeout is exceeded, the job CancelationToken is triggered and the worker that is running that job aborts that execution. At the same time, a new worker will pick that job and run it again. This is the expected behaviour (AFIAK).
The problem is that this only happens once, the worker that is now processing the long-running job, will no longer receive any kind of cancellation when the InvisibilityTimeout is exceeded. The same happens, if the server is restarted at this point, the job will no longer be processed again, and will be stuck in “Processing”.
This seems to be happening, because when the InvisibilityTimeout
cancels the first execution, in the Worker.Execute(BackgroundProcessContext context)
it will remove the job from the queue, despite the fact that the job is still being processed by another worker.
This happens in these lines:
In the Worker.Execute(BackgroundProcessContext context)
var state = PerformJob(context, connection, fetchedJob.JobId);
>>>> In this example state is null
if (state != null)
{
// Ignore return value, because we should not do anything when the current state is not Processing.
TryChangeState(context, connection, fetchedJob, state, new[] { ProcessingState.StateName }, CancellationToken.None, context.ShutdownToken);
}
>>>>> I'm not sure if the following assumption is correct, because in our case the job was not performed,
>>>>>it is still being executed (in another worker), so I don't think it should be removed from the queue
// Checkpoint #4. The job was performed, and it is in the one
// of the explicit states (Succeeded, Scheduled and so on).
// It should not be re-queued, but we still need to remove its
// processing information.
requeueOnException = false;
fetchedJob.RemoveFromQueue();
Steps to reproduce
- Download this repro project: https://github.com/braca/hangfire-mongo-bug
- Run the project and go to: http://localhost:5000/swagger
- POST /Test/LongJobtesting
- Wait a few minutes
- The job will be running for 1minute in one of the available workers
- After 1min, the CancellationToken will be triggered (because of the InvisibilityTimeout)
- A new worker will pick the job and execute it
- This new worker will never receive any CancelationToken even if the job execution time exceeds the InvisibilityTimeout
- Restart the server
- The job will be stuck in Processing and it will never be picked by any worker
- You can check the logs/app.logs and confirm this
Logs
16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Started
16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 0 - Execution time: 00:00:00.0025689
16:50:47.244 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 1 - Execution time: 00:00:10.0158631
16:50:57.253 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 2 - Execution time: 00:00:20.0250140
16:51:07.256 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 3 - Execution time: 00:00:30.0278617
16:51:17.263 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 4 - Execution time: 00:00:40.0343608
16:51:27.262 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 5 - Execution time: 00:00:50.0336615
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - Started
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 0 - Execution time: 00:00:00.0010493
16:51:37.278 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 6 - Execution time: 00:01:00.0498272
16:51:37.320 [ERR] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Token was cancelled - Execution time: 00:01:00.0917212
16:51:47.265 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 1 - Execution time: 00:00:10.0047696
16:51:57.280 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 2 - Execution time: 00:00:20.0198516
16:52:07.281 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 3 - Execution time: 00:00:30.0211184
16:52:17.294 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 4 - Execution time: 00:00:40.0335476
16:52:27.311 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 5 - Execution time: 00:00:50.0509575
16:52:37.321 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 6 - Execution time: 00:01:00.0607559
16:52:47.333 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 7 - Execution time: 00:01:10.0725911
16:52:57.352 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 8 - Execution time: 00:01:20.0913322
16:53:07.363 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 9 - Execution time: 00:01:30.1025756
16:53:17.378 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 - Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 10 - Execution time: 00:01:40.1181397
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Hi @gottscj,
With the 1.7.0 release I cannot replicate our issue, it seems to be fixed, but I’ll try to do more tests
Thanks 😃
@braca,
Thanks for the links. I will check this out ASAP.
Thanks!