Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`CommClosedErrors` resulting from `Client.shutdown()`

See original GitHub issue

Describe the issue:

I am experiencing CommClosedError errors during the shutdown of a manually constructed Dask cluster using the Dask CLI. This was first noted in dask/dask-mpi#94.

Minimal Complete Verifiable Example:

You need a minimum of 3 open terminal sessions (e.g., bash) to reproduce the error.

In Terminal 1:

$ dask-scheduler

Note the ADDRESS in the Scheduler at: ADDRESS:8786 log message.

In Terminal 2:

$ dask-worker ADDRESS:8786

where ADDRESS is the scheduler IP address (without the port number).

In Terminal 3:

$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from distributed import Client
>>> client = Client('ADDRESS:8786')
>>> client.shutdown()

where ADDRESS is the scheduler IP address noted above.

Results

In Terminal 1, the scheduler shuts down appropriately without errors.

In Terminal 2, the worker shuts down, but not without error. The logs of the worker after client.shutdown() is called are:

2022-10-25 15:54:14,502 - distributed.worker - INFO - Stopping worker at ADDRESS:40141
2022-10-25 15:54:14,503 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-ce265b58-eecf-4335-bd1e-d0fc3b07e93d Address ADDRESS:40141 Status: Status.closing
2022-10-25 15:54:14,503 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=ADDRESS:33682 remote=ADDRESS:8786>
Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File ".../lib/python3.10/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File ".../lib/python3.10/site-packages/distributed/comm/tcp.py", line 269, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2022-10-25 15:54:14,507 - distributed.nanny - INFO - Worker closed
2022-10-25 15:54:14,508 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-25 15:54:14,699 - distributed.nanny - INFO - Closing Nanny at 'ADDRESS:46611'.
2022-10-25 15:54:14,699 - distributed.dask_worker - INFO - End worker

In Terminal 3, where the client is running, a CommClosedError appears every time the client heartbeat is called (about once every 5 seconds):

2022-10-25 15:54:15,825 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'ADDRESS:8786' processes=1 threads=8, memory=7.63 GiB>>
Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File ".../lib/python3.10/site-packages/distributed/client.py", line 1390, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
  File ".../lib/python3.10/site-packages/distributed/batched.py", line 156, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Client->Scheduler local=ADDRESS:33802 remote=ADDRESS:8786> already closed.

And this does not stop repeating until the Python process is exited (e.g., exit()).

Anything else we need to know?:

Interestingly, you can avoid the error that appears in Terminal 2 in the worker logs if you call client.retire_workers() before calling client.shutdown(), but the CommClosedError errors in the Client application (Terminal 3) are still present.

NOTE: The CommClosedError messages in the worker logs appears to have been introduced in version 2022.4.2. These errors do not appear in version 2022.4.1 or 2022.4.0.

Environment:

Dask version: 2022.10.0 --> 2022.4.2 / Client heartbeat CommClosedErrors appear much further back
Python version: 3.10.X, 3.9.X, 3.8.X
Operating System: Linux, Windows
Install method (conda, pip, source): conda (from conda-forge)

Issue Analytics

State:
Created a year ago
Comments:13 (2 by maintainers)

Top GitHub Comments

2reactions

lgarrisoncommented, Oct 27, 2022

Here’s dask_mpi.initialize(exit=False):

Log

(venv8) lgarrison@scclin021:~/scc/daskdistrib$ srun -n3 -p scc python ./repro_commclosed.py 
srun: job 1923374 queued and waiting for resources
srun: job 1923374 has been allocated resources
2022-10-27 19:19:48,692 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-10-27 19:19:48,767 - distributed.scheduler - INFO - State start
2022-10-27 19:19:48,774 - distributed.scheduler - INFO -   Scheduler at: tcp://10.128.145.133:42799
2022-10-27 19:19:48,775 - distributed.scheduler - INFO -   dashboard at:                     :8787
2022-10-27 19:19:48,824 - distributed.worker - INFO -       Start worker at: tcp://10.128.145.133:35207
2022-10-27 19:19:48,824 - distributed.worker - INFO -          Listening to: tcp://10.128.145.133:35207
2022-10-27 19:19:48,824 - distributed.worker - INFO -           Worker name:                          2
2022-10-27 19:19:48,824 - distributed.worker - INFO -          dashboard at:       10.128.145.133:43001
2022-10-27 19:19:48,824 - distributed.worker - INFO - Waiting to connect to: tcp://10.128.145.133:42799
2022-10-27 19:19:48,824 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:19:48,824 - distributed.worker - INFO -               Threads:                          1
2022-10-27 19:19:48,824 - distributed.worker - INFO -                Memory:                  17.86 GiB
2022-10-27 19:19:48,824 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-uc23mab3
2022-10-27 19:19:48,824 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:19:49,772 - distributed.scheduler - INFO - Receive client connection: Client-d9f8b61a-564d-11ed-9035-7cd30ad7b998
2022-10-27 19:19:49,774 - distributed.core - INFO - Starting established connection
2022-10-27 19:19:49,797 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.128.145.133:35207', name: 2, status: init, memory: 0, processing: 0>
2022-10-27 19:19:49,798 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.128.145.133:35207
2022-10-27 19:19:49,798 - distributed.core - INFO - Starting established connection
2022-10-27 19:19:49,798 - distributed.worker - INFO -         Registered to: tcp://10.128.145.133:42799
2022-10-27 19:19:49,798 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:19:49,799 - distributed.core - INFO - Starting established connection
2022-10-27 19:19:50,780 - distributed.scheduler - INFO - Retiring worker tcp://10.128.145.133:35207
2022-10-27 19:19:50,781 - distributed.active_memory_manager - INFO - Retiring worker tcp://10.128.145.133:35207; no unique keys need to be moved away.
2022-10-27 19:19:50,781 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.128.145.133:35207', name: 2, status: closing_gracefully, memory: 0, processing: 0>
2022-10-27 19:19:50,781 - distributed.core - INFO - Removing comms to tcp://10.128.145.133:35207
2022-10-27 19:19:50,781 - distributed.scheduler - INFO - Lost all workers
2022-10-27 19:19:50,781 - distributed.scheduler - INFO - Retired worker tcp://10.128.145.133:35207
2022-10-27 19:19:50,786 - distributed.worker - INFO - Stopping worker at tcp://10.128.145.133:35207
2022-10-27 19:19:50,787 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-efb29a00-13f9-4eb3-ba7b-d8dac55100ed Address tcp://10.128.145.133:35207 Status: Status.closing
2022-10-27 19:19:50,793 - distributed.scheduler - INFO - Receive client connection: Client-db2b8532-564d-11ed-9036-7cd30ad7b998
2022-10-27 19:19:50,794 - distributed.core - INFO - Starting established connection
2022-10-27 19:19:51,783 - distributed.scheduler - INFO - Scheduler closing...
2022-10-27 19:19:51,784 - distributed.scheduler - INFO - Scheduler closing all comms
2022-10-27 19:19:54,909 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,910 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,910 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,910 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,911 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,911 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-10-27 19:19:54,911 - distributed.nanny - ERROR - Worker process died unexpectedly

Traceback (most recent call last):
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/client.py", line 1246, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/client.py", line 1276, in _ensure_connected
    comm = await connect(
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/comm/core.py", line 315, in connect
    await asyncio.sleep(backoff)
  File "/mnt/sw/nix/store/db63z7j5w4n84c625pv5b57m699bnbws-python-3.8.12-view/lib/python3.8/asyncio/tasks.py", line 659, in sleep
    return await future
asyncio.exceptions.CancelledError

Traceback (most recent call last):
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/client.py", line 1451, in _handle_report
    await self._reconnect()
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/client.py", line 1246, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/client.py", line 1276, in _ensure_connected
    comm = await connect(
  File "/mnt/home/lgarrison/scc/daskdistrib/venv8/lib/python3.8/site-packages/distributed/comm/core.py", line 315, in connect
    await asyncio.sleep(backoff)
  File "/mnt/sw/nix/store/db63z7j5w4n84c625pv5b57m699bnbws-python-3.8.12-view/lib/python3.8/asyncio/tasks.py", line 659, in sleep
    return await future
asyncio.exceptions.CancelledError
2022-10-27 19:20:21,885 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
Traceback (most recent call last):
  File "./repro_commclosed.py", line 17, in <module>
    main()
  File "./repro_commclosed.py", line 10, in main
    while len(client.scheduler_info()['workers']) < 1:
KeyError: 'workers'
srun: error: worker1133: task 2: Exited with exit code 1

I think all those errors are spurious, though. If I don’t let the scheduler and worker fall through to start a new client, then I get no errors. Meaning main() looks like this:

def main():
    dask_mpi.initialize(exit=False)

    from mpi4py import MPI

    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    if rank != 1:
        return

    with Client() as client:
        ...

Then I get this log with no errors:

Log

(venv8) lgarrison@scclin021:~/scc/daskdistrib$ srun -n3 -p scc python ./repro_commclosed.py 
srun: job 1923390 queued and waiting for resources
srun: job 1923390 has been allocated resources
2022-10-27 19:28:25,824 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-10-27 19:28:25,900 - distributed.scheduler - INFO - State start
2022-10-27 19:28:25,908 - distributed.scheduler - INFO -   Scheduler at: tcp://10.128.145.70:45219
2022-10-27 19:28:25,908 - distributed.scheduler - INFO -   dashboard at:                     :8787
2022-10-27 19:28:25,957 - distributed.worker - INFO -       Start worker at:  tcp://10.128.145.70:43259
2022-10-27 19:28:25,958 - distributed.worker - INFO -          Listening to:  tcp://10.128.145.70:43259
2022-10-27 19:28:25,958 - distributed.worker - INFO -           Worker name:                          2
2022-10-27 19:28:25,958 - distributed.worker - INFO -          dashboard at:        10.128.145.70:43161
2022-10-27 19:28:25,958 - distributed.worker - INFO - Waiting to connect to:  tcp://10.128.145.70:45219
2022-10-27 19:28:25,958 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:28:25,958 - distributed.worker - INFO -               Threads:                          1
2022-10-27 19:28:25,958 - distributed.worker - INFO -                Memory:                  17.86 GiB
2022-10-27 19:28:25,958 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-3amduzbz
2022-10-27 19:28:25,958 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:28:26,914 - distributed.scheduler - INFO - Receive client connection: Client-0e353432-564f-11ed-8f09-7cd30ac60c1e
2022-10-27 19:28:26,917 - distributed.core - INFO - Starting established connection
2022-10-27 19:28:26,944 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://10.128.145.70:43259', name: 2, status: init, memory: 0, processing: 0>
2022-10-27 19:28:26,944 - distributed.scheduler - INFO - Starting worker compute stream, tcp://10.128.145.70:43259
2022-10-27 19:28:26,945 - distributed.core - INFO - Starting established connection
2022-10-27 19:28:26,945 - distributed.worker - INFO -         Registered to:  tcp://10.128.145.70:45219
2022-10-27 19:28:26,945 - distributed.worker - INFO - -------------------------------------------------
2022-10-27 19:28:26,946 - distributed.core - INFO - Starting established connection
2022-10-27 19:28:27,923 - distributed.scheduler - INFO - Retiring worker tcp://10.128.145.70:43259
2022-10-27 19:28:27,923 - distributed.active_memory_manager - INFO - Retiring worker tcp://10.128.145.70:43259; no unique keys need to be moved away.
2022-10-27 19:28:27,924 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.128.145.70:43259', name: 2, status: closing_gracefully, memory: 0, processing: 0>
2022-10-27 19:28:27,924 - distributed.core - INFO - Removing comms to tcp://10.128.145.70:43259
2022-10-27 19:28:27,924 - distributed.scheduler - INFO - Lost all workers
2022-10-27 19:28:27,924 - distributed.scheduler - INFO - Retired worker tcp://10.128.145.70:43259
2022-10-27 19:28:27,930 - distributed.worker - INFO - Stopping worker at tcp://10.128.145.70:43259
2022-10-27 19:28:27,930 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-ea76a657-c3a1-47cb-9ee3-dd77bed84fb7 Address tcp://10.128.145.70:43259 Status: Status.closing
2022-10-27 19:28:28,926 - distributed.scheduler - INFO - Scheduler closing...
2022-10-27 19:28:28,927 - distributed.scheduler - INFO - Scheduler closing all comms

1reaction

lgarrisoncommented, Oct 28, 2022