question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

See original GitHub issue

I have 2 machines: a worker machine and a scheduler machine.

The worker machine is centos 7 with python3.7 and dask-distributed 2.5.2.

The scheduler machine has a docker container running. The docker container has the same version of python and dask, and incidentally, it is also a centos 7 image.

I start the scheduler docker container with this docker-compose yaml:

    version: '3.7'
    services:
      service1:
        image: ...
        container_name: ...
        network_mode: bridge
        env_file:
          - ~/.env
        ports:
          - "8888:8888"
          - "9796:9796"
          - "9797:9797"
        volumes:
          ...
        command: jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

(Notice I’m mapping the two ports needed for a scheduler to run.)

When I start up the a dask-worker on the worker box and a dask-scheduler in the docker container. everything seems to initiate correctly except after a little bit I get this error:

[root@510b0c5af190 web]# my_project.py run distributed_workflow
Traceback (most recent call last):
  File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 237, in write
    stream.write(b)
  File "/conda/lib/python3.7/site-packages/tornado/iostream.py", line 546, in write
    self._check_closed()
  File "/conda/lib/python3.7/site-packages/tornado/iostream.py", line 1009, in _check_closed
    raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/conda/bin/sql_server", line 11, in <module>
    load_entry_point('sql-server', 'console_scripts', 'sql_server')()
  File "/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/app/sql_server/cli/sql_server.py", line 28, in run
    daily(True)
  File "/app/sql_server/cli/run/daily.py", line 166, in daily
    verbose=True,
  File "/wcf/spartans/spartans/spartans.py", line 116, in __enter__
    self.open()
  File "/wcf/spartans/spartans/spartans.py", line 123, in open
    self._start_agents()
  File "/wcf/spartans/spartans/spartans.py", line 179, in _start_agents
    set_as_default=False)
  File "/conda/lib/python3.7/site-packages/distributed/client.py", line 715, in __init__
    self.start(timeout=timeout)
  File "/conda/lib/python3.7/site-packages/distributed/client.py", line 880, in start
    sync(self.loop, self._start, **kwargs)
  File "/conda/lib/python3.7/site-packages/distributed/utils.py", line 333, in sync
    raise exc.with_traceback(tb)
  File "/conda/lib/python3.7/site-packages/distributed/utils.py", line 317, in f
    result[0] = yield future
  File "/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/conda/lib/python3.7/site-packages/distributed/client.py", line 973, in _start
    await self._ensure_connected(timeout=timeout)
  File "/conda/lib/python3.7/site-packages/distributed/client.py", line 1040, in _ensure_connected
    {"op": "register-client", "client": self.id, "reply": False}
  File "/conda/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 254, in write
    convert_stream_closed_error(self, e)
  File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 132, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

So I investigate the logs. The worker log looks like this:

(base) [worker@worker-03 tmp]$ cat worker_output.txt
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.1.25.3:43111'
distributed.diskutils - INFO - Found stale lock file and directory '/home/worker/worker-h8jhtnng', purging
distributed.worker - INFO -       Start worker at:      tcp://10.1.25.3:42739
distributed.worker - INFO -          Listening to:      tcp://10.1.25.3:42739
distributed.worker - INFO -          dashboard at:            10.1.25.3:40970
distributed.worker - INFO - Waiting to connect to:   tcp://scheduler.myco.com:9796
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                   33.53 GB
distributed.worker - INFO -       Local Directory: /home/worker/worker-mf3marrd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://scheduler.myco.com:9796
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

and my scheduler log (inside my docker container, on the scheduler.myco.com machine) looks like this:

[root@510b0c5af190 web]# cat /tmp/worker_output.txt 
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-lq7oa5sc
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://172.17.0.2:9796
distributed.scheduler - INFO -   dashboard at:                     :9797
distributed.scheduler - INFO - Register tcp://10.1.25.3:42739
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.1.25.3:42739
distributed.core - INFO - Starting established connection

Now, there are no errors in the logs. Indeed, I inspect the running processes and I see this:

Worker machine

worker  8974  1.6  0.1 394088 45044 ?        Sl   16:04   0:14 /data/anaconda3/bin/python /data/anaconda3/bin/dask-worker scheduler.myco.com:9796 --dashboard-prefix my_workflow

Scheduler container:

root        33  1.4  1.4 670884 115056 pts/0   Sl   16:04   0:15 /conda/bin/python /conda/bin/dask-scheduler --port 9796 --dashboard-address 172.17.0.2:9797 --dashboard-prefix my_workflow

Notice the 172.17.0.2 address is the address inside the scheduler container. If I try to initiate the dask-address as the hostname of the host machine instead I get this error [Errno 99] Cannot assign requested address presumably because the port 9797 is already taken by the docker container.

Anyway. These processes are still running, yet to my knowledge, they’re not working on the workflow I tried to pass to the worker. Can you please help me understand what I’m doing wrong to produce the error above?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

5reactions
navjotkcommented, Jul 7, 2020

I too am facing this issue and it won’t go away with dask/distributed 2.20

1reaction
twoertweincommented, Jan 24, 2021

I created a reproducible example. It isn’t minimal but fairly compact. Based on the stacktrace the error seems to have a different code path than what previous people have reported.

The first run is usually successful but executing it a second time immediately after it leads to the error. I have no dask config files.

I cannot reproduce it with pure=False and I can also not reproduce it when setting n_workers=2 with no adapt().

import random
import string
from pathlib import Path

import dask
from dask.distributed import Client, LocalCluster


def dummy(*args, **kwargs):
    import time

    time.sleep(10)
    return True


def test():
    delayed_function = dask.delayed(dummy, pure=True)
    targets = [
        delayed_function(delayed_function),
        delayed_function(delayed_function),
        delayed_function(delayed_function, delayed_function),
    ]

    random_suffix = "".join(random.choices(string.ascii_letters + string.digits, k=10))
    with LocalCluster(
        local_directory=Path(".") / f"dask_{random_suffix}",
        threads_per_worker=1,
        processes=True,
        n_workers=1,
    ) as cluster:
        cluster.adapt(minimum=1, maximum=2)
        with Client(cluster) as client:
            print(dask.compute(*targets, scheduler=client))


if __name__ == "__main__":
    test()

Typical output

$ python test_tcp.py
(True, True, True)
Task was destroyed but it is pending!
task: <Task pending name='Task-152' coro=<AdaptiveCore.adapt() done, defined at /pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f7c6f4a75b0>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/tornado/ioloop.py:688]>


$ python test_tcp.py
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f475e2d68e0>>, <Task finished name='Task-65' coro=<Worker.close() done, defined at /pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/worker.py:1169> exception=CommClosedError('in <closed TCP>: Stream is closed')>)
Traceback (most recent call last):
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/comm/tcp.py", line 187, in read
    n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/worker.py", line 1193, in close
    await r.close_gracefully()
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/core.py", line 858, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/core.py", line 641, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/comm/tcp.py", line 202, in read
    convert_stream_closed_error(self, e)
  File "/pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
(True, True, True)
Task was destroyed but it is pending!
task: <Task pending name='Task-152' coro=<AdaptiveCore.adapt() done, defined at /pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f0c8548f070>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /pool01/home/twoertwe/.cache/pypoetry/virtualenvs/panam-zGchuJUy-py3.8/lib/python3.8/site-packages/tornado/ioloop.py:688]>

edit: python 3.8.6 dask 2020.12.0 distributed 2021.01.1 tornado 6.1

Read more comments on GitHub >

github_iconTop Results From Across the Web

how to debug a CommClosedError in Dask Gateway deployed ...
I created this as a case yesterday (github.com/dask/dask-gateway/issues/345) and noticed that you encountered the same issue as me. – Riley Hun.
Read more >
Source code for distributed.comm.core
This will attempt to flush outgoing buffers before actually closing the underlying transport. This method returns a coroutine. """ [docs] @abstractmethod def ...
Read more >
Getting heartbeats from unknown workers - Dask Forum
2022-07-20 10:39:51,552 - distributed.worker - ERROR - Scheduler was ... CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker ...
Read more >
dask/dask - Gitter
... in convert_stream_closed_error raise CommClosedError("in %s: %s" % (obj, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed.
Read more >
Simple demonstration of the use of Dask/rsexecute - GitLab
from dask.distributed import Client, LocalCluster try: client = Client('scheduler:8786', timeout=10) ... CommClosedError: in <TCP (closed) Client->Scheduler ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found