question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Worker stuck in "closing-gracefully" state

See original GitHub issue

I have a cluster composed of 8 workers distributed accross in k8s pods in AWS. I noticed that sometimes one of the workers gets stuck in the β€œclosing-gracefully” state because of an assertion error in the worker’s close_gracefully() method.

For some reason the scheduler still tries to send tasks to this worker, which just fail after some time because the worker does not actually execute them.

Manually closing the worker using the client’s retire_workers() methods works and I’m currently using it as a workaround.

After digging around the code base a bit, I found that the part of the code responsible for this behaviour is in the scheduler’s replicate() method. The failing assertion, which I did not completely understand, is not handled properly and so leads the worker to not close properly.

From analyzing the expression I could conclude the following:

  • n_missing is greater than 0, otherwise the method would have returned
  • branching_factor’s default value is used which 2

From those two points it seems that len(ts.who_has) is 0

Unfortunately I did not yet find a minimal example to reproduce this example.

Stack Trace
2019-09-02 at 01:40:09 | LOCALDEV | DASK     | INFO     | distributed.worker:close_gracefully:1116 - Closing worker gracefully: tcp://10.165.119.248:37069
2019-09-02 at 01:40:10 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:10 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:11 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | agena_data_acquisition.tools.flow.utilities:set_container_components_on_hash_cache:152 - Setting categories and timestamps on hash cache
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:13 | LOCALDEV | DASK     | INFO     | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:15 | LOCALDEV | DASK     | ERROR    | tornado.ioloop:_run_callback:763 - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>, <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1104> exception=AssertionError()>)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
    β”‚          β”‚  β”‚         β”‚        β”” {<Signals.SIGCHLD: 17>: <Handlers.SIG_DFL: 0>, <Signals.SIGINT: 2>: <built-in function default_int_handler>}
    β”‚          β”‚  β”‚         β”” 17
    β”‚          β”‚  β”” <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
    β”‚          β”” <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
    β”” <function _serve_one at 0x7f625c10c730>
  File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    code = spawn._main(child_r)
           β”‚     β”‚     β”” 11
           β”‚     β”” <function _main at 0x7f625c111d08>
           β”” <module 'multiprocessing.spawn' from '/usr/lib/python3.6/multiprocessing/spawn.py'>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 118, in _main
    return self._bootstrap()
           β”‚    β”” <function BaseProcess._bootstrap at 0x7f625d3ddd08>
           β”” <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    β”‚    β”” <function BaseProcess.run at 0x7f625d3dd510>
    β”” <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    β”‚    β”‚        β”‚    β”‚        β”‚    β”” {}
    β”‚    β”‚        β”‚    β”‚        β”” <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
    β”‚    β”‚        β”‚    β”” (<bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>, (), {'worker_kwargs': {'scheduler_ip': 'tcp:...
    β”‚    β”‚        β”” <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
    β”‚    β”” <bound method AsyncProcess._run of <class 'distributed.process.AsyncProcess'>>
    β”” <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
  File "/opt/venv/lib/python3.6/site-packages/distributed/process.py", line 179, in _run
    target(*args, **kwargs)
    β”‚       β”‚       β”” {'worker_kwargs': {'scheduler_ip': 'tcp://dask-scheduler:8786', 'nthreads': 4, 'local_directory': '/tmp', 'services': None, '...
    β”‚       β”” ()
    β”” <bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>
  File "/opt/venv/lib/python3.6/site-packages/distributed/nanny.py", line 697, in _run
    loop.run_sync(run)
    β”‚    β”‚        β”” <function WorkerProcess._run.<locals>.run at 0x7f623d58c8c8>
    β”‚    β”” <function IOLoop.run_sync at 0x7f625a0e0378>
    β”” <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 526, in run_sync
    self.start()
    β”‚    β”” <function BaseAsyncIOLoop.start at 0x7f625993b6a8>
    β”” <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/opt/venv/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
    β”‚    β”‚            β”” <function BaseEventLoop.run_forever at 0x7f625acee158>
    β”‚    β”” <_UnixSelectorEventLoop running=True closed=False debug=False>
    β”” <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
  File "/usr/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
    β”‚    β”” <function BaseEventLoop._run_once at 0x7f625acf9620>
    β”” <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
    β”‚      β”” <function Handle._run at 0x7f625acdb2f0>
    β”” <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
    β”‚    β”‚          β”‚    β”” <member '_args' of 'Handle' objects>
    β”‚    β”‚          β”” <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
    β”‚    β”” <member '_callback' of 'Handle' objects>
    β”” <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
           β”‚  β”‚    β”‚             β”‚         β”‚       β”‚         β”” <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
           β”‚  β”‚    β”‚             β”‚         β”‚       β”” <bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>
           β”‚  β”‚    β”‚             β”‚         β”” <class 'functools.partial'>
           β”‚  β”‚    β”‚             β”” <module 'functools' from '/usr/lib/python3.6/functools.py'>
           β”‚  β”‚    β”” <function IOLoop._run_callback at 0x7f625a0e0b70>
           β”‚  β”” <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
           β”” <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
> File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
          β”” functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db4...
  File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
    β”‚      β”” <method 'result' of '_asyncio.Task' objects>
    β”” <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
  File "/opt/venv/lib/python3.6/site-packages/distributed/worker.py", line 1118, in close_gracefully
    await self.scheduler.retire_workers(workers=[self.address], remove=False)
          β”‚    β”‚                                 β”‚    β”” <property object at 0x7f62590b5098>
          β”‚    β”‚                                 β”” <Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
          β”‚    β”” <pooled rpc to 'tcp://dask-scheduler:8786'>
          β”” <Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 750, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
                   β”‚         β”‚    β”‚        β”‚      β”” {'workers': ['tcp://10.165.119.248:37069'], 'remove': False}
                   β”‚         β”‚    β”‚        β”” 'retire_workers'
                   β”‚         β”‚    β”” <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786>
                   β”‚         β”” <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786>
                   β”” <function send_recv at 0x7f625912dc80>
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 559, in send_recv
    six.reraise(*clean_exception(**response))
    β”‚   β”‚        β”‚                 β”” {'status': 'uncaught-error', 'text': '', 'exception': AssertionError(), 'traceback': <traceback object at 0x7f60f8ba7e48>}
    β”‚   β”‚        β”” <function clean_exception at 0x7f62590b6e18>
    β”‚   β”” <function reraise at 0x7f625a0b17b8>
    β”” <module 'six' from '/opt/venv/lib/python3.6/site-packages/six.py'>
  File "/opt/venv/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
          β”‚                    β”” None
          β”” None
  File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 416, in handle_comm
    result = await result
  File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 3117, in retire_workers
    delete=False,
  File "/opt/venv/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 2881, in replicate
    assert count > 0

AssertionError: assert count > 0
2019-09-02 at 01:55:24 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 01:55:25 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:11:15 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 02:11:15 | LOCALDEV | DASK     | WARNING  | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:27:08 | LOCALDEV | DASK     | INFO     | distributed.worker:run:3377 - Run out-of-band function 'collect'
Workers status
>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:37697': 'running',
 'tcp://10.165.118.190:40097': 'running',
 'tcp://10.165.119.194:43069': 'closing-gracefully',
 'tcp://10.165.134.5:37665': 'running',
 'tcp://10.165.134.98:37897': 'running',
 'tcp://10.165.135.151:41777': 'running',
 'tcp://10.165.135.24:39843': 'running',
 'tcp://10.165.135.75:41797': 'running'}
Manually Closing the Worker
>>> client.retire_workers(['tcp://10.165.119.194:43069'])
{'tcp://10.165.119.194:43069': {'type': 'Worker', 'id': 'tcp://10.165.119.194:43069', 'host': '10.165.119.194', 'resources': {}, 'local_directory': '/tmp/worker-insrgwz0', 'name': 'tcp://10.165.119.194:43069', 'nthreads': 4, 'memory_limit': 6000000000, 'last_seen': 1567423741.9343932, 'services': {'dashboard': 40601}, 'metrics': {'cpu': 4.0, 'memory': 1920286720, 'time': 1567423741.4320846, 'read_bytes': 285.9653648363741, 'write_bytes': 773.9062670746627, 'num_fds': 1284, 'executing': 0, 'in_memory': 0, 'ready': 0, 'in_flight': 0, 'bandwidth': 100000000}, 'nanny': 'tcp://10.165.119.194:32957'}}

>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:41061': 'running',
 'tcp://10.165.118.190:40097': 'running',
 'tcp://10.165.119.194:39373': 'running',
 'tcp://10.165.134.5:37665': 'running',
 'tcp://10.165.134.98:37897': 'running',
 'tcp://10.165.135.151:41777': 'running',
 'tcp://10.165.135.24:39843': 'running',
 'tcp://10.165.135.75:41797': 'running'}

Scheduler Info
>>> pprint(client.scheduler_info())
{'address': 'tcp://10.165.119.247:8786',
 'id': 'Scheduler-ae134b55-02b0-4644-9053-1a4d27cd6253',
 'services': {'dashboard': 8787},
 'type': 'Scheduler',
 'workers': {'tcp://10.165.118.150:40363': {'host': '10.165.118.150',
                                            'id': 'tcp://10.165.118.150:40363',
                                            'last_seen': 1567418324.9379842,
                                            'local_directory': '/tmp/worker-yn1xir95',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 0.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 106196992,
                                                        'num_fds': 27,
                                                        'read_bytes': 285.2919233032259,
                                                        'ready': 0,
                                                        'time': 1567418324.9362829,
                                                        'write_bytes': 768.0936396625311},
                                            'name': 'tcp://10.165.118.150:40363',
                                            'nanny': 'tcp://10.165.118.150:41743',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 44259},
                                            'type': 'Worker'},
             'tcp://10.165.118.190:39499': {'host': '10.165.118.190',
                                            'id': 'tcp://10.165.118.190:39499',
                                            'last_seen': 1567418325.1259258,
                                            'local_directory': '/tmp/worker-n67kzmw4',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 0.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 107712512,
                                                        'num_fds': 27,
                                                        'read_bytes': 285.8943480874215,
                                                        'ready': 0,
                                                        'time': 1567418325.1242785,
                                                        'write_bytes': 769.7155525430579},
                                            'name': 'tcp://10.165.118.190:39499',
                                            'nanny': 'tcp://10.165.118.190:37897',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 36281},
                                            'type': 'Worker'},
             'tcp://10.165.119.194:43069': {'host': '10.165.119.194',
                                            'id': 'tcp://10.165.119.194:43069',
                                            'last_seen': 1567418324.9348905,
                                            'local_directory': '/tmp/worker-insrgwz0',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 2.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 1865625600,
                                                        'num_fds': 1284,
                                                        'read_bytes': 285.88140257823795,
                                                        'ready': 0,
                                                        'time': 1567418324.9321342,
                                                        'write_bytes': 773.6790405439028},
                                            'name': 'tcp://10.165.119.194:43069',
                                            'nanny': 'tcp://10.165.119.194:32957',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 40601},
                                            'type': 'Worker'},
             'tcp://10.165.134.5:45259': {'host': '10.165.134.5',
                                          'id': 'tcp://10.165.134.5:45259',
                                          'last_seen': 1567418325.0068722,
                                          'local_directory': '/tmp/worker-c9_tdwxs',
                                          'memory_limit': 6000000000,
                                          'metrics': {'bandwidth': 100000000,
                                                      'cpu': 0.0,
                                                      'executing': 0,
                                                      'in_flight': 0,
                                                      'in_memory': 0,
                                                      'memory': 114638848,
                                                      'num_fds': 27,
                                                      'read_bytes': 285.86423455352053,
                                                      'ready': 0,
                                                      'time': 1567418325.004854,
                                                      'write_bytes': 765.636376461527},
                                          'name': 'tcp://10.165.134.5:45259',
                                          'nanny': 'tcp://10.165.134.5:42299',
                                          'nthreads': 4,
                                          'resources': {},
                                          'services': {'dashboard': 42483},
                                          'type': 'Worker'},
             'tcp://10.165.134.98:42651': {'host': '10.165.134.98',
                                           'id': 'tcp://10.165.134.98:42651',
                                           'last_seen': 1567418325.138247,
                                           'local_directory': '/tmp/worker-gmprr26a',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 2.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 114290688,
                                                       'num_fds': 27,
                                                       'read_bytes': 286.39261759547566,
                                                       'ready': 0,
                                                       'time': 1567418324.6365747,
                                                       'write_bytes': 769.0543017948438},
                                           'name': 'tcp://10.165.134.98:42651',
                                           'nanny': 'tcp://10.165.134.98:45353',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 37307},
                                           'type': 'Worker'},
             'tcp://10.165.135.151:46743': {'host': '10.165.135.151',
                                            'id': 'tcp://10.165.135.151:46743',
                                            'last_seen': 1567418325.299708,
                                            'local_directory': '/tmp/worker-ytdfgks_',
                                            'memory_limit': 6000000000,
                                            'metrics': {'bandwidth': 100000000,
                                                        'cpu': 4.0,
                                                        'executing': 0,
                                                        'in_flight': 0,
                                                        'in_memory': 0,
                                                        'memory': 114499584,
                                                        'num_fds': 27,
                                                        'read_bytes': 286.0085919100238,
                                                        'ready': 0,
                                                        'time': 1567418325.29762,
                                                        'write_bytes': 770.0231320654489},
                                            'name': 'tcp://10.165.135.151:46743',
                                            'nanny': 'tcp://10.165.135.151:35201',
                                            'nthreads': 4,
                                            'resources': {},
                                            'services': {'dashboard': 34289},
                                            'type': 'Worker'},
             'tcp://10.165.135.24:39503': {'host': '10.165.135.24',
                                           'id': 'tcp://10.165.135.24:39503',
                                           'last_seen': 1567418325.1116345,
                                           'local_directory': '/tmp/worker-96w5bp0w',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 2.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 113909760,
                                                       'num_fds': 27,
                                                       'read_bytes': 285.5692941296901,
                                                       'ready': 0,
                                                       'time': 1567418325.1096437,
                                                       'write_bytes': 766.8434192014055},
                                           'name': 'tcp://10.165.135.24:39503',
                                           'nanny': 'tcp://10.165.135.24:37487',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 44285},
                                           'type': 'Worker'},
             'tcp://10.165.135.75:42995': {'host': '10.165.135.75',
                                           'id': 'tcp://10.165.135.75:42995',
                                           'last_seen': 1567418325.064751,
                                           'local_directory': '/tmp/worker-_zkylo2z',
                                           'memory_limit': 6000000000,
                                           'metrics': {'bandwidth': 100000000,
                                                       'cpu': 0.0,
                                                       'executing': 0,
                                                       'in_flight': 0,
                                                       'in_memory': 0,
                                                       'memory': 112508928,
                                                       'num_fds': 27,
                                                       'read_bytes': 285.7694387958324,
                                                       'ready': 0,
                                                       'time': 1567418325.062967,
                                                       'write_bytes': 767.3808706125849},
                                           'name': 'tcp://10.165.135.75:42995',
                                           'nanny': 'tcp://10.165.135.75:34145',
                                           'nthreads': 4,
                                           'resources': {},
                                           'services': {'dashboard': 35467},
                                           'type': 'Worker'}}}

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
chinmaychandakcommented, Jul 1, 2020

I think that what people may not understand is that it’s no one’s job to fix these issues. The people who do so are often doing so for free as volunteers, or because they need to fix them to solve some problem that they’re having at work.

Unfortunately people sometimes treat these community github issue trackers as a place where they go to ask people to do free work for them. They look a lot like other github issue trackers that they use in their workplace to ask other teams in their company to do work for them, which is reasonable given that those teams are paid.

Instead, I encourage you to think of these issue trackers as a place to collaborate on work. @AnesBenmerzoug was kind enough to make a reproducer (at significant personal cost it sounds like) great, who can take up the torch and work from there? Alternatively, if there are people paid by your company to fix these problems then maybe you can point them here and they can do this work.

People like me volunteer our time to help shepherd this process along, but we’re not here to fix everyone’s problem for free. There are too many problems to fix unfortunately, and we tend to be pretty busy fixing the problems that people pay us to fix.

I definitely agree with everything here, and I again sincerely apologize for the inappropriate phrasing. Did not ever intend to ask people to fix my problem, or to stress you or any of the other maintainers. I think it’s brilliant enough that so many important features are getting merged into Dask, and open-source projects in general! πŸ˜ƒ

If the reproducer provided by @AnesBenmerzoug matches your situation then great. I just ran it but after a few minutes I’m not quite sure what I’m looking at, so I’m probably going to move on. Maybe you can help investigate here?

Yes, I am going to try to investigate this soon. Will post findings here.

1reaction
fjettercommented, Dec 14, 2021

We’re currently working on making the graceful downscaling much more robust which should avoid the above replicate assertion error. See https://github.com/dask/distributed/pull/5381 for the current WIP. we’re struggling with a few flaky tests bur are hoping to merge soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache MPM workers stuck in G (gracefully finishing) growing
Gracefully finishing workers just grows over time, they dont seem to ever finish. Eventually I run out of capacity and get "scoreboard is...
Read more >
[Bug 61551] New: Event MPM workers stuck in Gracefully Finishing ...
Bug ID: 61551 Summary: Event MPM workers stuck in Gracefully Finishing with no connections left Product: Apache httpd-2 Version: 2.4.27 Hardware: PC OS:Β ......
Read more >
Apache processes stuck in "Gracefully finishing" state
I have a big problem with apache 2.2.4-3ubuntu0.1 of latest gutsy. Apache prefork uses a "internal dummy connection" to tell childs they shouldΒ ......
Read more >
T182832 Apache on phab1001 is gradually leaking worker ...
Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state. Closed, ResolvedPublic. Actions.
Read more >
63975 – Some processes never terminate after graceful restart
Bug 63975 - Some processes never terminate after graceful restart ... Processes stuck in this state are not responding to any attempts toΒ ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found