[Bug] Worker stuck in "closing-gracefully" state
See original GitHub issueI have a cluster composed of 8 workers distributed accross in k8s pods in AWS. I noticed that sometimes one of the workers gets stuck in the βclosing-gracefullyβ state because of an assertion error in the workerβs close_gracefully() method.
For some reason the scheduler still tries to send tasks to this worker, which just fail after some time because the worker does not actually execute them.
Manually closing the worker using the clientβs retire_workers() methods works and Iβm currently using it as a workaround.
After digging around the code base a bit, I found that the part of the code responsible for this behaviour is in the schedulerβs replicate() method. The failing assertion, which I did not completely understand, is not handled properly and so leads the worker to not close properly.
From analyzing the expression I could conclude the following:
n_missing
is greater than 0, otherwise the method would have returnedbranching_factor
βs default value is used which 2
From those two points it seems that len(ts.who_has)
is 0
Unfortunately I did not yet find a minimal example to reproduce this example.
Stack Trace
2019-09-02 at 01:40:09 | LOCALDEV | DASK | INFO | distributed.worker:close_gracefully:1116 - Closing worker gracefully: tcp://10.165.119.248:37069
2019-09-02 at 01:40:10 | LOCALDEV | DASK | INFO | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:10 | LOCALDEV | DASK | INFO | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:11 | LOCALDEV | DASK | WARNING | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK | WARNING | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 01:40:13 | LOCALDEV | DASK | INFO | agena_data_acquisition.tools.flow.utilities:set_container_components_on_hash_cache:152 - Setting categories and timestamps on hash cache
2019-09-02 at 01:40:13 | LOCALDEV | DASK | INFO | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:13 | LOCALDEV | DASK | INFO | distributed.worker:transition_executing_done:1646 - Comm closed
2019-09-02 at 01:40:15 | LOCALDEV | DASK | ERROR | tornado.ioloop:_run_callback:763 - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>, <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1104> exception=AssertionError()>)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
_serve_one(s, listener, alive_r, old_handlers)
β β β β β {<Signals.SIGCHLD: 17>: <Handlers.SIG_DFL: 0>, <Signals.SIGINT: 2>: <built-in function default_int_handler>}
β β β β 17
β β β <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
β β <socket.socket [closed] fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
β <function _serve_one at 0x7f625c10c730>
File "/usr/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
code = spawn._main(child_r)
β β β 11
β β <function _main at 0x7f625c111d08>
β <module 'multiprocessing.spawn' from '/usr/lib/python3.6/multiprocessing/spawn.py'>
File "/usr/lib/python3.6/multiprocessing/spawn.py", line 118, in _main
return self._bootstrap()
β β <function BaseProcess._bootstrap at 0x7f625d3ddd08>
β <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
β β <function BaseProcess.run at 0x7f625d3dd510>
β <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
β β β β β β {}
β β β β β <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
β β β β (<bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>, (), {'worker_kwargs': {'scheduler_ip': 'tcp:...
β β β <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
β β <bound method AsyncProcess._run of <class 'distributed.process.AsyncProcess'>>
β <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>
File "/opt/venv/lib/python3.6/site-packages/distributed/process.py", line 179, in _run
target(*args, **kwargs)
β β β {'worker_kwargs': {'scheduler_ip': 'tcp://dask-scheduler:8786', 'nthreads': 4, 'local_directory': '/tmp', 'services': None, '...
β β ()
β <bound method WorkerProcess._run of <class 'distributed.nanny.WorkerProcess'>>
File "/opt/venv/lib/python3.6/site-packages/distributed/nanny.py", line 697, in _run
loop.run_sync(run)
β β β <function WorkerProcess._run.<locals>.run at 0x7f623d58c8c8>
β β <function IOLoop.run_sync at 0x7f625a0e0378>
β <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 526, in run_sync
self.start()
β β <function BaseAsyncIOLoop.start at 0x7f625993b6a8>
β <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
File "/opt/venv/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
self.asyncio_loop.run_forever()
β β β <function BaseEventLoop.run_forever at 0x7f625acee158>
β β <_UnixSelectorEventLoop running=True closed=False debug=False>
β <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
File "/usr/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
self._run_once()
β β <function BaseEventLoop._run_once at 0x7f625acf9620>
β <_UnixSelectorEventLoop running=True closed=False debug=False>
File "/usr/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
handle._run()
β β <function Handle._run at 0x7f625acdb2f0>
β <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
self._callback(*self._args)
β β β β <member '_args' of 'Handle' objects>
β β β <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
β β <member '_callback' of 'Handle' objects>
β <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...ertionError()>) at /opt/venv/lib/python3.6/site-packages/tornado/...
File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
lambda f: self._run_callback(functools.partial(callback, future))
β β β β β β β <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
β β β β β β <bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>>
β β β β β <class 'functools.partial'>
β β β β <module 'functools' from '/usr/lib/python3.6/functools.py'>
β β β <function IOLoop._run_callback at 0x7f625a0e0b70>
β β <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db400>
β <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
> File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
β functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f625d4db4...
File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
β β <method 'result' of '_asyncio.Task' objects>
β <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1...
File "/opt/venv/lib/python3.6/site-packages/distributed/worker.py", line 1118, in close_gracefully
await self.scheduler.retire_workers(workers=[self.address], remove=False)
β β β β <property object at 0x7f62590b5098>
β β β <Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
β β <pooled rpc to 'tcp://dask-scheduler:8786'>
β <Worker: tcp://10.165.119.248:37069, closing-gracefully, stored: 37, running: 2/4, ready: 0, comm: 0, waiting: 0>
File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 750, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
β β β β β {'workers': ['tcp://10.165.119.248:37069'], 'remove': False}
β β β β 'retire_workers'
β β β <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786>
β β <TCP ConnectionPool local=tcp://10.165.119.248:58730 remote=tcp://dask-scheduler:8786>
β <function send_recv at 0x7f625912dc80>
File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 559, in send_recv
six.reraise(*clean_exception(**response))
β β β β {'status': 'uncaught-error', 'text': '', 'exception': AssertionError(), 'traceback': <traceback object at 0x7f60f8ba7e48>}
β β β <function clean_exception at 0x7f62590b6e18>
β β <function reraise at 0x7f625a0b17b8>
β <module 'six' from '/opt/venv/lib/python3.6/site-packages/six.py'>
File "/opt/venv/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
β β None
β None
File "/opt/venv/lib/python3.6/site-packages/distributed/core.py", line 416, in handle_comm
result = await result
File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 3117, in retire_workers
delete=False,
File "/opt/venv/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
yielded = self.gen.send(value)
File "/opt/venv/lib/python3.6/site-packages/distributed/scheduler.py", line 2881, in replicate
assert count > 0
AssertionError: assert count > 0
2019-09-02 at 01:55:24 | LOCALDEV | DASK | INFO | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 01:55:25 | LOCALDEV | DASK | WARNING | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:11:15 | LOCALDEV | DASK | INFO | distributed.worker:run:3377 - Run out-of-band function 'collect'
2019-09-02 at 02:11:15 | LOCALDEV | DASK | WARNING | distributed.utils_perf:_gc_callback:204 - full garbage collections took 30% CPU time recently (threshold: 10%)
2019-09-02 at 02:27:08 | LOCALDEV | DASK | INFO | distributed.worker:run:3377 - Run out-of-band function 'collect'
Workers status
>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:37697': 'running',
'tcp://10.165.118.190:40097': 'running',
'tcp://10.165.119.194:43069': 'closing-gracefully',
'tcp://10.165.134.5:37665': 'running',
'tcp://10.165.134.98:37897': 'running',
'tcp://10.165.135.151:41777': 'running',
'tcp://10.165.135.24:39843': 'running',
'tcp://10.165.135.75:41797': 'running'}
Manually Closing the Worker
>>> client.retire_workers(['tcp://10.165.119.194:43069'])
{'tcp://10.165.119.194:43069': {'type': 'Worker', 'id': 'tcp://10.165.119.194:43069', 'host': '10.165.119.194', 'resources': {}, 'local_directory': '/tmp/worker-insrgwz0', 'name': 'tcp://10.165.119.194:43069', 'nthreads': 4, 'memory_limit': 6000000000, 'last_seen': 1567423741.9343932, 'services': {'dashboard': 40601}, 'metrics': {'cpu': 4.0, 'memory': 1920286720, 'time': 1567423741.4320846, 'read_bytes': 285.9653648363741, 'write_bytes': 773.9062670746627, 'num_fds': 1284, 'executing': 0, 'in_memory': 0, 'ready': 0, 'in_flight': 0, 'bandwidth': 100000000}, 'nanny': 'tcp://10.165.119.194:32957'}}
>>> pprint(client.run(lambda dask_worker: dask_worker.status))
{'tcp://10.165.118.150:41061': 'running',
'tcp://10.165.118.190:40097': 'running',
'tcp://10.165.119.194:39373': 'running',
'tcp://10.165.134.5:37665': 'running',
'tcp://10.165.134.98:37897': 'running',
'tcp://10.165.135.151:41777': 'running',
'tcp://10.165.135.24:39843': 'running',
'tcp://10.165.135.75:41797': 'running'}
Scheduler Info
>>> pprint(client.scheduler_info())
{'address': 'tcp://10.165.119.247:8786',
'id': 'Scheduler-ae134b55-02b0-4644-9053-1a4d27cd6253',
'services': {'dashboard': 8787},
'type': 'Scheduler',
'workers': {'tcp://10.165.118.150:40363': {'host': '10.165.118.150',
'id': 'tcp://10.165.118.150:40363',
'last_seen': 1567418324.9379842,
'local_directory': '/tmp/worker-yn1xir95',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 0.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 106196992,
'num_fds': 27,
'read_bytes': 285.2919233032259,
'ready': 0,
'time': 1567418324.9362829,
'write_bytes': 768.0936396625311},
'name': 'tcp://10.165.118.150:40363',
'nanny': 'tcp://10.165.118.150:41743',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 44259},
'type': 'Worker'},
'tcp://10.165.118.190:39499': {'host': '10.165.118.190',
'id': 'tcp://10.165.118.190:39499',
'last_seen': 1567418325.1259258,
'local_directory': '/tmp/worker-n67kzmw4',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 0.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 107712512,
'num_fds': 27,
'read_bytes': 285.8943480874215,
'ready': 0,
'time': 1567418325.1242785,
'write_bytes': 769.7155525430579},
'name': 'tcp://10.165.118.190:39499',
'nanny': 'tcp://10.165.118.190:37897',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 36281},
'type': 'Worker'},
'tcp://10.165.119.194:43069': {'host': '10.165.119.194',
'id': 'tcp://10.165.119.194:43069',
'last_seen': 1567418324.9348905,
'local_directory': '/tmp/worker-insrgwz0',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 2.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 1865625600,
'num_fds': 1284,
'read_bytes': 285.88140257823795,
'ready': 0,
'time': 1567418324.9321342,
'write_bytes': 773.6790405439028},
'name': 'tcp://10.165.119.194:43069',
'nanny': 'tcp://10.165.119.194:32957',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 40601},
'type': 'Worker'},
'tcp://10.165.134.5:45259': {'host': '10.165.134.5',
'id': 'tcp://10.165.134.5:45259',
'last_seen': 1567418325.0068722,
'local_directory': '/tmp/worker-c9_tdwxs',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 0.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 114638848,
'num_fds': 27,
'read_bytes': 285.86423455352053,
'ready': 0,
'time': 1567418325.004854,
'write_bytes': 765.636376461527},
'name': 'tcp://10.165.134.5:45259',
'nanny': 'tcp://10.165.134.5:42299',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 42483},
'type': 'Worker'},
'tcp://10.165.134.98:42651': {'host': '10.165.134.98',
'id': 'tcp://10.165.134.98:42651',
'last_seen': 1567418325.138247,
'local_directory': '/tmp/worker-gmprr26a',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 2.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 114290688,
'num_fds': 27,
'read_bytes': 286.39261759547566,
'ready': 0,
'time': 1567418324.6365747,
'write_bytes': 769.0543017948438},
'name': 'tcp://10.165.134.98:42651',
'nanny': 'tcp://10.165.134.98:45353',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 37307},
'type': 'Worker'},
'tcp://10.165.135.151:46743': {'host': '10.165.135.151',
'id': 'tcp://10.165.135.151:46743',
'last_seen': 1567418325.299708,
'local_directory': '/tmp/worker-ytdfgks_',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 4.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 114499584,
'num_fds': 27,
'read_bytes': 286.0085919100238,
'ready': 0,
'time': 1567418325.29762,
'write_bytes': 770.0231320654489},
'name': 'tcp://10.165.135.151:46743',
'nanny': 'tcp://10.165.135.151:35201',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 34289},
'type': 'Worker'},
'tcp://10.165.135.24:39503': {'host': '10.165.135.24',
'id': 'tcp://10.165.135.24:39503',
'last_seen': 1567418325.1116345,
'local_directory': '/tmp/worker-96w5bp0w',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 2.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 113909760,
'num_fds': 27,
'read_bytes': 285.5692941296901,
'ready': 0,
'time': 1567418325.1096437,
'write_bytes': 766.8434192014055},
'name': 'tcp://10.165.135.24:39503',
'nanny': 'tcp://10.165.135.24:37487',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 44285},
'type': 'Worker'},
'tcp://10.165.135.75:42995': {'host': '10.165.135.75',
'id': 'tcp://10.165.135.75:42995',
'last_seen': 1567418325.064751,
'local_directory': '/tmp/worker-_zkylo2z',
'memory_limit': 6000000000,
'metrics': {'bandwidth': 100000000,
'cpu': 0.0,
'executing': 0,
'in_flight': 0,
'in_memory': 0,
'memory': 112508928,
'num_fds': 27,
'read_bytes': 285.7694387958324,
'ready': 0,
'time': 1567418325.062967,
'write_bytes': 767.3808706125849},
'name': 'tcp://10.165.135.75:42995',
'nanny': 'tcp://10.165.135.75:34145',
'nthreads': 4,
'resources': {},
'services': {'dashboard': 35467},
'type': 'Worker'}}}
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (6 by maintainers)
Top GitHub Comments
I definitely agree with everything here, and I again sincerely apologize for the inappropriate phrasing. Did not ever intend to ask people to fix my problem, or to stress you or any of the other maintainers. I think itβs brilliant enough that so many important features are getting merged into Dask, and open-source projects in general! π
Yes, I am going to try to investigate this soon. Will post findings here.
Weβre currently working on making the graceful downscaling much more robust which should avoid the above replicate assertion error. See https://github.com/dask/distributed/pull/5381 for the current WIP. weβre struggling with a few flaky tests bur are hoping to merge soon.