question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When the next-in-line worker dies, the entire cluster stops processing tasks.

See original GitHub issue

I think I’ve found the cause of https://github.com/Koed00/django-q/issues/218 and https://github.com/Koed00/django-q/issues/200. Django Q generally cycles through which worker gets the next task. If the next worker in that cycle gets killed, the cluster will fail to give out any future tasks, even if the worker got reincarnated. This will also prevent the cluster from shutting down properly (I’ve been killing the guard task and then killing the remaining tasks).

This is a major problem in 1-worker clusters, where the worker dying always causes the cluster to stop functioning. I suspect 2-worker clusters are much more than twice as reliable because workers are most likely to die while processing a task. A good stopgap would be to put a minimum on the default number of workers (which is currently set by multiprocessing.cpu_count()).

Here’s a log of a 2-worker cluster:

Cluster initialization:

04:47:49 [Q] INFO Q Cluster-44650 starting.
04:47:50 [Q] INFO Process-1:1 ready for work at 44656
04:47:50 [Q] INFO Process-1:2 ready for work at 44657
04:47:50 [Q] INFO Process-1 guarding cluster at 44655
04:47:50 [Q] INFO Process-1:3 monitoring at 44658
04:47:50 [Q] INFO Q Cluster-44650 running.
04:47:50 [Q] INFO Process-1:4 pushing tasks at 44659

Alternating workers:

04:47:50 [Q] INFO Process-1:1 processing [earth-edward-jig-oregon]
04:47:50 [Q] INFO Process-1:2 processing [sad-pennsylvania-kansas-carolina]
04:47:50 [Q] INFO Process-1:1 processing [coffee-two-bakerloo-nevada]
04:47:50 [Q] INFO Processed [sad-pennsylvania-kansas-carolina]
04:47:50 [Q] INFO Processed [earth-edward-jig-oregon]
04:47:50 [Q] INFO Processed [coffee-two-bakerloo-nevada]
04:48:03 [Q] INFO Process-1:2 processing [alabama-alanine-nitrogen-colorado]
04:48:03 [Q] INFO Processed [alabama-alanine-nitrogen-colorado]
04:48:07 [Q] INFO Process-1:1 processing [undress-speaker-speaker-friend]
04:48:07 [Q] INFO Processed [undress-speaker-speaker-friend]

1:2 is next, so killing 1:1 doesn’t cause an issue. Both 1:2 and 1:5 process tasks.

04:48:21 [Q] ERROR reincarnated worker Process-1:1 after death
04:48:21 [Q] INFO Process-1:5 ready for work at 44860
04:48:25 [Q] INFO Process-1:2 processing [nevada-butter-juliet-high]
04:48:25 [Q] INFO Processed [nevada-butter-juliet-high]
04:48:41 [Q] INFO Process-1:5 processing [montana-oscar-asparagus-charlie]
04:48:41 [Q] INFO Processed [montana-oscar-asparagus-charlie]

1:2 is next, so killing 1:5 is safe.

04:49:03 [Q] ERROR reincarnated worker Process-1:5 after death
04:49:03 [Q] INFO Process-1:6 ready for work at 45089
04:49:07 [Q] INFO Process-1:2 processing [johnny-triple-crazy-queen]
04:49:07 [Q] INFO Processed [johnny-triple-crazy-queen]
04:49:09 [Q] INFO Process-1:6 processing [mockingbird-fish-single-comet]
04:49:09 [Q] INFO Processed [mockingbird-fish-single-comet]

Again, 1:2 is next, so killing 1:7 is safe.

04:49:25 [Q] ERROR reincarnated worker Process-1:6 after death
04:49:25 [Q] INFO Process-1:7 ready for work at 45218
04:49:31 [Q] INFO Process-1:2 processing [thirteen-white-princess-burger]
04:49:31 [Q] INFO Processed [thirteen-white-princess-burger]
04:49:33 [Q] INFO Process-1:7 processing [winter-iowa-uniform-yellow]
04:49:33 [Q] INFO Processed [winter-iowa-uniform-yellow]
04:49:35 [Q] INFO Process-1:2 processing [three-idaho-comet-fix]
04:49:35 [Q] INFO Processed [three-idaho-comet-fix]

Now 1:7 is next, so killing 1:7 causes the cluster to stop processing tasks.

04:49:49 [Q] ERROR reincarnated worker Process-1:7 after death
04:49:49 [Q] INFO Process-1:8 ready for work at 45369
(Cluster stopped receiving tasks)

Here’s proof that you can kill all the original workers and still have things working:

05:18:34 [Q] INFO Q Cluster-52176 starting.
05:18:34 [Q] INFO Process-1:1 ready for work at 52183
05:18:34 [Q] INFO Process-1:2 ready for work at 52184
05:18:34 [Q] INFO Process-1 guarding cluster at 52182
05:18:34 [Q] INFO Process-1:3 monitoring at 52185
05:18:34 [Q] INFO Q Cluster-52176 running.
05:18:34 [Q] INFO Process-1:4 pushing tasks at 52186
05:18:34 [Q] INFO Process-1:1 processing [carbon-ten-zebra-fillet]
05:18:34 [Q] INFO Process-1:2 processing [jersey-three-nineteen-nineteen]
05:18:34 [Q] INFO Processed [carbon-ten-zebra-fillet]
05:18:34 [Q] INFO Processed [jersey-three-nineteen-nineteen]
05:18:38 [Q] INFO Process-1:2 processing [july-ceiling-network-comet]
05:18:38 [Q] INFO Processed [july-ceiling-network-comet]
05:18:42 [Q] INFO Process-1:1 processing [september-lemon-social-april]
05:18:42 [Q] INFO Processed [september-lemon-social-april]
05:19:08 [Q] ERROR reincarnated worker Process-1:1 after death
05:19:08 [Q] INFO Process-1:5 ready for work at 52384
05:19:11 [Q] INFO Process-1:2 processing [magnesium-robert-william-ten]
05:19:11 [Q] INFO Processed [magnesium-robert-william-ten]
05:19:12 [Q] INFO Process-1:5 processing [fruit-don-sodium-cup]
05:19:12 [Q] INFO Processed [fruit-don-sodium-cup]
05:19:17 [Q] INFO Process-1:2 processing [seven-bravo-river-washington]
05:19:17 [Q] INFO Processed [seven-bravo-river-washington]
05:19:24 [Q] ERROR reincarnated worker Process-1:2 after death
05:19:24 [Q] INFO Process-1:6 ready for work at 52491
05:19:27 [Q] INFO Process-1:5 processing [uncle-vegan-sodium-finch]
05:19:27 [Q] INFO Processed [uncle-vegan-sodium-finch]
05:20:22 [Q] ERROR reincarnated worker Process-1:6 after death
05:20:22 [Q] INFO Process-1:7 ready for work at 52782
(Cluster stopped receiving tasks)

Here’s a 10-worker cluster that I brought down with the same pattern:

05:23:03 [Q] INFO Q Cluster-53249 starting.
05:23:03 [Q] INFO Process-1:1 ready for work at 53256
05:23:03 [Q] INFO Process-1:2 ready for work at 53257
05:23:03 [Q] INFO Process-1:3 ready for work at 53258
05:23:03 [Q] INFO Process-1:4 ready for work at 53259
05:23:03 [Q] INFO Process-1:5 ready for work at 53260
05:23:03 [Q] INFO Process-1:6 ready for work at 53261
05:23:03 [Q] INFO Process-1:7 ready for work at 53262
05:23:03 [Q] INFO Process-1:8 ready for work at 53263
05:23:03 [Q] INFO Process-1:9 ready for work at 53264
05:23:04 [Q] INFO Process-1 guarding cluster at 53255
05:23:04 [Q] INFO Process-1:10 ready for work at 53265
05:23:04 [Q] INFO Process-1:11 monitoring at 53266
05:23:04 [Q] INFO Q Cluster-53249 running.
05:23:04 [Q] INFO Process-1:12 pushing tasks at 53267
05:23:04 [Q] INFO Process-1:1 processing [florida-wolfram-leopard-lamp]
05:23:04 [Q] INFO Processed [florida-wolfram-leopard-lamp]
05:23:14 [Q] INFO Process-1:2 processing [dakota-undress-maryland-mars]
05:23:15 [Q] INFO Processed [dakota-undress-maryland-mars]
05:23:21 [Q] INFO Process-1:3 processing [oklahoma-snake-one-tango]
05:23:21 [Q] INFO Processed [oklahoma-snake-one-tango]
05:23:22 [Q] INFO Process-1:4 processing [summer-lamp-eighteen-foxtrot]
05:23:22 [Q] INFO Processed [summer-lamp-eighteen-foxtrot]
05:23:23 [Q] INFO Process-1:5 processing [pizza-north-delaware-bulldog]
05:23:23 [Q] INFO Processed [pizza-north-delaware-bulldog]
05:23:23 [Q] INFO Process-1:6 processing [louisiana-shade-aspen-eleven]
05:23:23 [Q] INFO Processed [louisiana-shade-aspen-eleven]
05:23:24 [Q] INFO Process-1:7 processing [cardinal-echo-blue-river]
05:23:24 [Q] INFO Processed [cardinal-echo-blue-river]
05:23:25 [Q] INFO Process-1:8 processing [happy-mirror-cup-zebra]
05:23:25 [Q] INFO Processed [happy-mirror-cup-zebra]
05:23:26 [Q] INFO Process-1:9 processing [butter-mississippi-snake-princess]
05:23:26 [Q] INFO Processed [butter-mississippi-snake-princess]
05:23:27 [Q] INFO Process-1:10 processing [georgia-indigo-queen-glucose]
05:23:27 [Q] INFO Processed [georgia-indigo-queen-glucose]
05:23:28 [Q] INFO Process-1:1 processing [vegan-hydrogen-louisiana-twelve]
05:23:28 [Q] INFO Processed [vegan-hydrogen-louisiana-twelve]
05:23:29 [Q] INFO Process-1:2 processing [may-wyoming-papa-india]
05:23:29 [Q] INFO Processed [may-wyoming-papa-india]
05:23:30 [Q] INFO Process-1:3 processing [network-six-bulldog-yellow]
05:23:30 [Q] INFO Processed [network-six-bulldog-yellow]
05:23:39 [Q] ERROR reincarnated worker Process-1:4 after death
05:23:39 [Q] INFO Process-1:13 ready for work at 53543
(Cluster stopped receiving tasks)

When tasks are received at the same time, the pattern is less clear. Both of these examples broke the cluster:

05:06:44 [Q] INFO Q Cluster-50031 starting.
05:06:44 [Q] INFO Process-1:1 ready for work at 50037
05:06:44 [Q] INFO Process-1 guarding cluster at 50036
05:06:44 [Q] INFO Process-1:2 ready for work at 50038
05:06:44 [Q] INFO Process-1:3 monitoring at 50039
05:06:44 [Q] INFO Q Cluster-50031 running.
05:06:44 [Q] INFO Process-1:4 pushing tasks at 50040
05:06:44 [Q] INFO Process-1:1 processing [red-floor-lactose-stairway]
05:06:44 [Q] INFO Process-1:2 processing [friend-pennsylvania-robin-sad]
05:06:44 [Q] INFO Process-1:1 processing [georgia-black-football-single]
05:06:44 [Q] INFO Process-1:2 processing [sodium-cat-oven-xray]
05:06:44 [Q] INFO Processed [friend-pennsylvania-robin-sad]
05:06:44 [Q] INFO Processed [red-floor-lactose-stairway]
05:06:44 [Q] INFO Processed [sodium-cat-oven-xray]
05:06:44 [Q] INFO Processed [georgia-black-football-single]
05:07:09 [Q] ERROR reincarnated worker Process-1:2 after death
05:07:09 [Q] INFO Process-1:5 ready for work at 50186
(Cluster stopped receiving tasks)
05:14:54 [Q] INFO Q Cluster-51780 starting.
05:14:54 [Q] INFO Process-1:1 ready for work at 51786
05:14:54 [Q] INFO Process-1:2 ready for work at 51787
05:14:54 [Q] INFO Process-1 guarding cluster at 51785
05:14:54 [Q] INFO Process-1:3 monitoring at 51788
05:14:54 [Q] INFO Q Cluster-51780 running.
05:14:54 [Q] INFO Process-1:4 pushing tasks at 51789
05:14:54 [Q] INFO Process-1:1 processing [avocado-wyoming-yellow-fix]
05:14:54 [Q] INFO Process-1:2 processing [illinois-charlie-foxtrot-nitrogen]
05:14:54 [Q] INFO Processed [avocado-wyoming-yellow-fix]
05:14:54 [Q] INFO Processed [illinois-charlie-foxtrot-nitrogen]
05:15:07 [Q] ERROR reincarnated worker Process-1:2 after death
05:15:07 [Q] INFO Process-1:5 ready for work at 51872
(Cluster stopped receiving tasks)

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

5reactions
Moon-developercommented, Apr 8, 2021

I seem to be facing the same issue, is there any news on improvements?

Did you manage to come up with a fix for this? I see this issue is rare and goes back to 2016 😅 currently facing the same thing and not looking forward to spending much time wondering what’s causing it.

1reaction
willaaamcommented, Mar 8, 2021

I seem to be facing the same issue, is there any news on improvements?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster hangs with a few tasks in "processing" state but no cpu ...
An attempt at a data transfer to facilitate one of those tasks then fails. Then the scheduler tries to restart the workers, which...
Read more >
Workers Guide — Faust 1.9.0 documentation
This part describes managing individual instances and is more relevant in development. Make sure you also read the ref: worker-cluster section of this...
Read more >
SLEHA 15 SP1 | Executing Maintenance Tasks
To perform maintenance tasks on the cluster nodes, you might need to stop the resources running on that node, to move them, or...
Read more >
Scheduling a Job - Research Computing Support
The clusters run jobs based on a queue system provided by the software Slurm. Jobs are submitted on a cluster to this scheduling...
Read more >
Workers Guide — Celery 5.2.7 documentation
Stopping the worker​​ When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates. If these tasks are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found