Restart worker on CommClosedError
See original GitHub issueOn a dask-gateway GKE cluster, I have a few workers that stop processing and the logs show some combination of CommClosedError, missing dependency warnings, and garbage collection. The scheduler seems happy with the worker, as the “Last seen” remains up to date.
I’ve attached an example log. This worker was still in “processing” for its current task, which I think is sub-second, after 30 minutes. I killed the worker and the graph backed up to redo the lost work, and eventually completed.
I have written dask-cluster-manager jobs which restart schedulers leaking memory. I see there is a client.get_scheduler_logs()
, which could be parsed to detect this. Or is there some way to detect this and restart the worker (besides me searching the GKE dashboard and manually doing it)? Is there a setting that can be used to somehow mitigate this? The lifetime.{duration,stagger,restart}
seem like a last resort.
Logs
2021-06-05T22:10:08.218039578Z + '[' '' ']' 2021-06-05T22:10:08.218953428Z + '[' -e /opt/app/environment.yml ']' 2021-06-05T22:10:08.219179401Z no environment.yml 2021-06-05T22:10:08.219232124Z + echo 'no environment.yml' 2021-06-05T22:10:08.219250341Z + '[' '' ']' 2021-06-05T22:10:08.219255783Z + '[' '' ']' 2021-06-05T22:10:08.219260917Z + exec dask-worker tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786 --dashboard-address :8787 --name dask-worker-e85d926daa1646c2a782e2cf4ea00f25-kp6lp --nthreads 1 --memory-limit 7516192768 2021-06-05T22:10:09.406322052Z distributed.nanny - INFO - Start Nanny at: 'tls://10.4.91.2:44355' 2021-06-05T22:10:11.295639846Z distributed.worker - INFO - Start worker at: tls://10.4.91.2:46259 2021-06-05T22:10:11.296092345Z distributed.worker - INFO - Listening to: tls://10.4.91.2:46259 2021-06-05T22:10:11.296168743Z distributed.worker - INFO - dashboard at: 10.4.91.2:8787 2021-06-05T22:10:11.296178429Z distributed.worker - INFO - Waiting to connect to: tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786 2021-06-05T22:10:11.296184092Z distributed.worker - INFO - ------------------------------------------------- 2021-06-05T22:10:11.296189147Z distributed.worker - INFO - Threads: 1 2021-06-05T22:10:11.296222187Z distributed.worker - INFO - Memory: 7.00 GiB 2021-06-05T22:10:11.296260851Z distributed.worker - INFO - Local Directory: /workdir/dask-worker-space/worker-jrhd2vn0 2021-06-05T22:10:11.296284668Z distributed.worker - INFO - ------------------------------------------------- 2021-06-05T22:10:11.344733302Z distributed.worker - INFO - Starting Worker plugin <distributed.client._WorkerSetupPlugin object at 0-bfd8f69b-5b98-4ebd-a664-bfd8b2f9bc5d 2021-06-05T22:10:11.346146079Z distributed.worker - INFO - Registered to: tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786 2021-06-05T22:10:11.346253619Z distributed.worker - INFO - ------------------------------------------------- 2021-06-05T22:10:11.347449156Z distributed.core - INFO - Starting established connection 2021-06-05T22:10:40.508654171Z distributed.comm.tcp - INFO - Connection closed before handshake completed 2021-06-05T22:13:31.952504492Z distributed.utils_perf - INFO - full garbage collection released 160.61 MiB from 472 reference cycles (threshold: 9.54 MiB) 2021-06-05T22:13:38.009084289Z distributed.comm.tcp - INFO - Connection closed before handshake completed 2021-06-05T22:13:43.654490665Z distributed.comm.tcp - INFO - Connection closed before handshake completed 2021-06-05T22:14:45.191432862Z distributed.worker - ERROR - Worker stream died during communication: tls://10.4.86.3:45971 2021-06-05T22:14:45.191488484Z Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 200, in read n = await stream.read_into(frames) tornado.iostream.StreamClosedError: Stream is closed 2021-06-05T22:14:45.191531357Z 2021-06-05T22:14:45.191536179Z The above exception was the direct cause of the following exception: 2021-06-05T22:14:45.191541381Z 2021-06-05T22:14:45.191546152Z Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2189, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3475, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3455, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 645, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 206, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError("in %s: %s" % (obj, exc)) from exc distributed.comm.core.CommClosedError: in <closed TLS>: Stream is closed 2021-06-05T22:14:45.191741414Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-merge-transpose-654fb163d718a0167f4124e19d50f10c', 5, 2, 3, 0) 2021-06-05T22:14:45.202056511Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-split-1661c99969731fa78455440409ca42b2', 5746) 2021-06-05T22:14:46.034760748Z distributed.worker - INFO - Dependent not found: ('getitem-421b304eb1d88a03fa4a18a7ad28ada7', 1, 0, 12, 15) 0 . Asking scheduler 2021-06-05T22:15:14.977635691Z distributed.utils_perf - INFO - full garbage collection released 60.92 MiB from 667 reference cycles (threshold: 9.54 MiB) 2021-06-05T22:15:16.577678892Z distributed.worker - ERROR - Worker stream died during communication: tls://10.4.86.2:44013 2021-06-05T22:15:16.577729765Z OSError: [Errno 113] No route to host 2021-06-05T22:15:16.577771666Z The above exception was the direct cause of the following exception: 2021-06-05T22:15:16.577776322Z 2021-06-05T22:15:16.577781272Z Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 285, in connect comm = await asyncio.wait_for( File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 494, in wait_for return fut.result() File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 391, in connect convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TLSConnector object at 0x7f2acadb4a60>: OSError: [Errno 113] No route to host 2021-06-05T22:15:16.577837957Z 2021-06-05T22:15:16.577842845Z The above exception was the direct cause of the following exception: 2021-06-05T22:15:16.577850215Z 2021-06-05T22:15:16.577860139Z Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2189, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3475, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3452, in _get_data comm = await rpc.connect(worker) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1010, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 309, in connect raise IOError( OSError: Timed out trying to connect to tls://10.4.86.2:44013 after 10 s 2021-06-05T22:15:24.435351939Z distributed.utils_perf - INFO - full garbage collection released 229.12 MiB from 278 reference cycles (threshold: 9.54 MiB) 2021-06-05T22:15:41.568841350Z distributed.utils_perf - INFO - full garbage collection released 66.56 MiB from 373 reference cycles (threshold: 9.54 MiB) 2021-06-05T22:15:52.363204788Z distributed.utils_perf - INFO - full garbage collection released 143.41 MiB from 114 reference cycles (threshold: 9.54 MiB) 2021-06-05T22:16:00.234860246Z distributed.comm.tcp - WARNING - Listener on 'tls://10.4.91.2:46259': TLS handshake failed with remote 'tls://10.4.95.4:41278': EOF occurred in violation of protocol (_ssl.c:1131)
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Just wanted to say that things seem to have gotten better! I upgraded to the latest version, and haven’t seen this issue yet! Fingers crossed.
Thank you!
We’ve recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks. These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed. I would encourage you to try out the latest changes on
main
and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!Deadlock fix https://github.com/dask/distributed/pull/4784 Upcoming release https://github.com/dask/community/issues/165