question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Job stuck in the Processing state (InvisibilityTimeout not working properly)

See original GitHub issue

During some tests where our server is being restarted while jobs are being executed, we find out, that sometimes, jobs became stuck in the Processing state, and they never recover.

After we start the server, those jobs are never executed again no matter how much time we wait (we’ve waited for the invisibility timeout and nothing happens, we had left it for a few hours, and nothing happens as well)

The Hangfire UI seems to know, that the job is stuck/aborted as it shows the following message

The job was aborted – it is processed by server testpc:server1:33704:7daebff8-39b2-4899-9f5f-3dae35a39863 which is not in the active servers list for now. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.

After digging into this issue a little bit more, I’ve found out a weird behaviour that I think is causing this, and it’s related to InvisibilityTimeout.

In our example we have an InvisibilityTimeout of 2min (it can be 30min, it won’t make a difference). We have some long-running jobs that may exceed our InvisibilityTimeout (for our usecase, we are OK with that)

When the InvisibilityTimeout is exceeded, the job CancelationToken is triggered and the worker that is running that job aborts that execution. At the same time, a new worker will pick that job and run it again. This is the expected behaviour (AFIAK).

The problem is that this only happens once, the worker that is now processing the long-running job, will no longer receive any kind of cancellation when the InvisibilityTimeout is exceeded. The same happens, if the server is restarted at this point, the job will no longer be processed again, and will be stuck in “Processing”.

This seems to be happening, because when the InvisibilityTimeout cancels the first execution, in the Worker.Execute(BackgroundProcessContext context) it will remove the job from the queue, despite the fact that the job is still being processed by another worker.

This happens in these lines:

In the Worker.Execute(BackgroundProcessContext context)

var state = PerformJob(context, connection, fetchedJob.JobId);
>>>> In this example state is null

if (state != null)
{
	// Ignore return value, because we should not do anything when the current state is not Processing.
	TryChangeState(context, connection, fetchedJob, state, new[] { ProcessingState.StateName }, CancellationToken.None, context.ShutdownToken);
}

>>>>> I'm not sure if the following assumption is correct, because in our case the job was not performed, 
>>>>>it is still being executed (in another worker), so I don't think it should be removed from the queue

// Checkpoint #4. The job was performed, and it is in the one
// of the explicit states (Succeeded, Scheduled and so on).
// It should not be re-queued, but we still need to remove its
// processing information.

requeueOnException = false;
fetchedJob.RemoveFromQueue();

Steps to reproduce

  1. Download this repro project: https://github.com/braca/hangfire-mongo-bug
  2. Run the project and go to: http://localhost:5000/swagger
  3. POST /Test/LongJobtesting
  4. Wait a few minutes
  5. The job will be running for 1minute in one of the available workers
  6. After 1min, the CancellationToken will be triggered (because of the InvisibilityTimeout)
  7. A new worker will pick the job and execute it
  8. This new worker will never receive any CancelationToken even if the job execution time exceeds the InvisibilityTimeout
  9. Restart the server
  10. The job will be stuck in Processing and it will never be picked by any worker
  11. You can check the logs/app.logs and confirm this

Logs

16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Started
16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 0 - Execution time: 00:00:00.0025689
16:50:47.244 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 1 - Execution time: 00:00:10.0158631
16:50:57.253 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 2 - Execution time: 00:00:20.0250140
16:51:07.256 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 3 - Execution time: 00:00:30.0278617
16:51:17.263 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 4 - Execution time: 00:00:40.0343608
16:51:27.262 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 5 - Execution time: 00:00:50.0336615
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - Started
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 0 - Execution time: 00:00:00.0010493
16:51:37.278 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 6 - Execution time: 00:01:00.0498272
16:51:37.320 [ERR] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Token was cancelled - Execution time: 00:01:00.0917212
16:51:47.265 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 1 - Execution time: 00:00:10.0047696
16:51:57.280 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 2 - Execution time: 00:00:20.0198516
16:52:07.281 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 3 - Execution time: 00:00:30.0211184
16:52:17.294 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 4 - Execution time: 00:00:40.0335476
16:52:27.311 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 5 - Execution time: 00:00:50.0509575
16:52:37.321 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 6 - Execution time: 00:01:00.0607559
16:52:47.333 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 7 - Execution time: 00:01:10.0725911
16:52:57.352 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 8 - Execution time: 00:01:20.0913322
16:53:07.363 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 9 - Execution time: 00:01:30.1025756
16:53:17.378 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 10 - Execution time: 00:01:40.1181397

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
bracacommented, Apr 1, 2022

Hi @gottscj,

With the 1.7.0 release I cannot replicate our issue, it seems to be fixed, but I’ll try to do more tests

Thanks 😃

1reaction
gottscjcommented, Jan 20, 2022

@braca,

Thanks for the links. I will check this out ASAP.

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hangfire - Long running jobs repeats after 30mins
I'm using Hangfire.Core 1.7.7 and still having this issue after 30mins of running job. EDIT: Tried with: InvisibilityTimeout = TimeSpan.
Read more >
Handling long running tasks (+ long invisibility timeout) + ...
Hangfire server running in console application ... The invisibility timeout is set to 90 minutes, as some jobs might take this long.
Read more >
Jobs stuck in Initializing status - Server
I have two copies of the same job stuck in the Initializing status in Alteryx Server. All scheduled jobs are now getting queued...
Read more >
Celery ETA Tasks Demystified
Well, that depends in part on the visibility timeout. If it's short, say 30 seconds, it means that other workers will poll unacked...
Read more >
Processing Administration
Timeouts for stuck workers​​ The default value is 180,000 milliseconds (30 minutes). You may want to adjust this value if you have large...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found