Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Restart worker on CommClosedError

See original GitHub issue

On a dask-gateway GKE cluster, I have a few workers that stop processing and the logs show some combination of CommClosedError, missing dependency warnings, and garbage collection. The scheduler seems happy with the worker, as the “Last seen” remains up to date.

I’ve attached an example log. This worker was still in “processing” for its current task, which I think is sub-second, after 30 minutes. I killed the worker and the graph backed up to redo the lost work, and eventually completed.

I have written dask-cluster-manager jobs which restart schedulers leaking memory. I see there is a client.get_scheduler_logs(), which could be parsed to detect this. Or is there some way to detect this and restart the worker (besides me searching the GKE dashboard and manually doing it)? Is there a setting that can be used to somehow mitigate this? The lifetime.{duration,stagger,restart} seem like a last resort.

Logs


2021-06-05T22:10:08.218039578Z + '[' '' ']'
2021-06-05T22:10:08.218953428Z + '[' -e /opt/app/environment.yml ']'
2021-06-05T22:10:08.219179401Z no environment.yml
2021-06-05T22:10:08.219232124Z + echo 'no environment.yml'
2021-06-05T22:10:08.219250341Z + '[' '' ']'
2021-06-05T22:10:08.219255783Z + '[' '' ']'
2021-06-05T22:10:08.219260917Z + exec dask-worker tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786 --dashboard-address :8787 --name dask-worker-e85d926daa1646c2a782e2cf4ea00f25-kp6lp --nthreads 1 --memory-limit 7516192768
2021-06-05T22:10:09.406322052Z distributed.nanny - INFO -         Start Nanny at: 'tls://10.4.91.2:44355'
2021-06-05T22:10:11.295639846Z distributed.worker - INFO -       Start worker at:      tls://10.4.91.2:46259
2021-06-05T22:10:11.296092345Z distributed.worker - INFO -          Listening to:      tls://10.4.91.2:46259
2021-06-05T22:10:11.296168743Z distributed.worker - INFO -          dashboard at:             10.4.91.2:8787
2021-06-05T22:10:11.296178429Z distributed.worker - INFO - Waiting to connect to: tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786
2021-06-05T22:10:11.296184092Z distributed.worker - INFO - -------------------------------------------------
2021-06-05T22:10:11.296189147Z distributed.worker - INFO -               Threads:                          1
2021-06-05T22:10:11.296222187Z distributed.worker - INFO -                Memory:                   7.00 GiB
2021-06-05T22:10:11.296260851Z distributed.worker - INFO -       Local Directory: /workdir/dask-worker-space/worker-jrhd2vn0
2021-06-05T22:10:11.296284668Z distributed.worker - INFO - -------------------------------------------------
2021-06-05T22:10:11.344733302Z distributed.worker - INFO - Starting Worker plugin <distributed.client._WorkerSetupPlugin object at 0-bfd8f69b-5b98-4ebd-a664-bfd8b2f9bc5d
2021-06-05T22:10:11.346146079Z distributed.worker - INFO -         Registered to: tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786
2021-06-05T22:10:11.346253619Z distributed.worker - INFO - -------------------------------------------------
2021-06-05T22:10:11.347449156Z distributed.core - INFO - Starting established connection
2021-06-05T22:10:40.508654171Z distributed.comm.tcp - INFO - Connection closed before handshake completed
2021-06-05T22:13:31.952504492Z distributed.utils_perf - INFO - full garbage collection released 160.61 MiB from 472 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:13:38.009084289Z distributed.comm.tcp - INFO - Connection closed before handshake completed
2021-06-05T22:13:43.654490665Z distributed.comm.tcp - INFO - Connection closed before handshake completed
2021-06-05T22:14:45.191432862Z distributed.worker - ERROR - Worker stream died during communication: tls://10.4.86.3:45971
2021-06-05T22:14:45.191488484Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 200, in read
    n = await stream.read_into(frames)
tornado.iostream.StreamClosedError: Stream is closed
2021-06-05T22:14:45.191531357Z 
2021-06-05T22:14:45.191536179Z The above exception was the direct cause of the following exception:
2021-06-05T22:14:45.191541381Z 
2021-06-05T22:14:45.191546152Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2189, in gather_dep
    response = await get_data_from_worker(
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3475, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3455, in _get_data
    response = await send_recv(
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 645, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 206, in read
    convert_stream_closed_error(self, e)
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TLS>: Stream is closed
2021-06-05T22:14:45.191741414Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-merge-transpose-654fb163d718a0167f4124e19d50f10c', 5, 2, 3, 0)
2021-06-05T22:14:45.202056511Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-split-1661c99969731fa78455440409ca42b2', 5746)
2021-06-05T22:14:46.034760748Z distributed.worker - INFO - Dependent not found: ('getitem-421b304eb1d88a03fa4a18a7ad28ada7', 1, 0, 12, 15) 0 .  Asking scheduler
2021-06-05T22:15:14.977635691Z distributed.utils_perf - INFO - full garbage collection released 60.92 MiB from 667 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:16.577678892Z distributed.worker - ERROR - Worker stream died during communication: tls://10.4.86.2:44013
2021-06-05T22:15:16.577729765Z OSError: [Errno 113] No route to host

2021-06-05T22:15:16.577771666Z The above exception was the direct cause of the following exception:
2021-06-05T22:15:16.577776322Z 
2021-06-05T22:15:16.577781272Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 285, in connect
    comm = await asyncio.wait_for(
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 391, in connect
    convert_stream_closed_error(self, e)
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TLSConnector object at 0x7f2acadb4a60>: OSError: [Errno 113] No route to host
2021-06-05T22:15:16.577837957Z 
2021-06-05T22:15:16.577842845Z The above exception was the direct cause of the following exception:
2021-06-05T22:15:16.577850215Z 
2021-06-05T22:15:16.577860139Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2189, in gather_dep
    response = await get_data_from_worker(
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3475, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3452, in _get_data
    comm = await rpc.connect(worker)
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1010, in connect
    comm = await connect(
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 309, in connect
    raise IOError(
OSError: Timed out trying to connect to tls://10.4.86.2:44013 after 10 s
2021-06-05T22:15:24.435351939Z distributed.utils_perf - INFO - full garbage collection released 229.12 MiB from 278 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:41.568841350Z distributed.utils_perf - INFO - full garbage collection released 66.56 MiB from 373 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:52.363204788Z distributed.utils_perf - INFO - full garbage collection released 143.41 MiB from 114 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:16:00.234860246Z distributed.comm.tcp - WARNING - Listener on 'tls://10.4.91.2:46259': TLS handshake failed with remote 'tls://10.4.95.4:41278': EOF occurred in violation of protocol (_ssl.c:1131)

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

chrisroatcommented, Jul 1, 2021

Just wanted to say that things seem to have gotten better! I upgraded to the latest version, and haven’t seen this issue yet! Fingers crossed.

Thank you!

0reactions

fjettercommented, Jun 18, 2021

We’ve recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks. These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed. I would encourage you to try out the latest changes on main and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!

Deadlock fix https://github.com/dask/distributed/pull/4784 Upcoming release https://github.com/dask/community/issues/165

Top Results From Across the Web

Source code for distributed.scheduler

[docs]class MemoryState: """Memory readings on a worker or on the whole cluster. ... [docs] async def start_unsafe(self): """Clear out old state and restart...

Dask restart worker(s) using client - Stack Overflow

Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been...

distributed.nanny — Dask.distributed 2.11.0 documentation

[docs]class Nanny(ServerNode): """ A process to manage worker processes The nanny spins up Worker processes, watches then, and kills or restarts them as ......

"Workers don't have promised key" error and delayed ...

... 16:06:50,651 - distributed.nanny - WARNING - Restarting worker ... in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc ...

Failure to restart LocalCluster workers - pickle error

It gives a similar error with both spawn and forkserver multiprocessing methods. The restart is invoked from the driver host while the cluster ......