distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
See original GitHub issueI have 2 machines: a worker machine and a scheduler machine.
The worker machine is centos 7 with python3.7 and dask-distributed 2.5.2.
The scheduler machine has a docker container running. The docker container has the same version of python and dask, and incidentally, it is also a centos 7 image.
I start the scheduler docker container with this docker-compose yaml:
version: '3.7'
services:
service1:
image: ...
container_name: ...
network_mode: bridge
env_file:
- ~/.env
ports:
- "8888:8888"
- "9796:9796"
- "9797:9797"
volumes:
...
command: jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
(Notice I’m mapping the two ports needed for a scheduler to run.)
When I start up the a dask-worker on the worker box and a dask-scheduler in the docker container. everything seems to initiate correctly except after a little bit I get this error:
[root@510b0c5af190 web]# my_project.py run distributed_workflow
Traceback (most recent call last):
File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 237, in write
stream.write(b)
File "/conda/lib/python3.7/site-packages/tornado/iostream.py", line 546, in write
self._check_closed()
File "/conda/lib/python3.7/site-packages/tornado/iostream.py", line 1009, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/conda/bin/sql_server", line 11, in <module>
load_entry_point('sql-server', 'console_scripts', 'sql_server')()
File "/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/conda/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/app/sql_server/cli/sql_server.py", line 28, in run
daily(True)
File "/app/sql_server/cli/run/daily.py", line 166, in daily
verbose=True,
File "/wcf/spartans/spartans/spartans.py", line 116, in __enter__
self.open()
File "/wcf/spartans/spartans/spartans.py", line 123, in open
self._start_agents()
File "/wcf/spartans/spartans/spartans.py", line 179, in _start_agents
set_as_default=False)
File "/conda/lib/python3.7/site-packages/distributed/client.py", line 715, in __init__
self.start(timeout=timeout)
File "/conda/lib/python3.7/site-packages/distributed/client.py", line 880, in start
sync(self.loop, self._start, **kwargs)
File "/conda/lib/python3.7/site-packages/distributed/utils.py", line 333, in sync
raise exc.with_traceback(tb)
File "/conda/lib/python3.7/site-packages/distributed/utils.py", line 317, in f
result[0] = yield future
File "/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/conda/lib/python3.7/site-packages/distributed/client.py", line 973, in _start
await self._ensure_connected(timeout=timeout)
File "/conda/lib/python3.7/site-packages/distributed/client.py", line 1040, in _ensure_connected
{"op": "register-client", "client": self.id, "reply": False}
File "/conda/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
yielded = self.gen.send(value)
File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 254, in write
convert_stream_closed_error(self, e)
File "/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 132, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
So I investigate the logs. The worker log looks like this:
(base) [worker@worker-03 tmp]$ cat worker_output.txt
distributed.nanny - INFO - Start Nanny at: 'tcp://10.1.25.3:43111'
distributed.diskutils - INFO - Found stale lock file and directory '/home/worker/worker-h8jhtnng', purging
distributed.worker - INFO - Start worker at: tcp://10.1.25.3:42739
distributed.worker - INFO - Listening to: tcp://10.1.25.3:42739
distributed.worker - INFO - dashboard at: 10.1.25.3:40970
distributed.worker - INFO - Waiting to connect to: tcp://scheduler.myco.com:9796
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 33.53 GB
distributed.worker - INFO - Local Directory: /home/worker/worker-mf3marrd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://scheduler.myco.com:9796
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
and my scheduler log (inside my docker container, on the scheduler.myco.com machine) looks like this:
[root@510b0c5af190 web]# cat /tmp/worker_output.txt
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-lq7oa5sc
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.17.0.2:9796
distributed.scheduler - INFO - dashboard at: :9797
distributed.scheduler - INFO - Register tcp://10.1.25.3:42739
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.1.25.3:42739
distributed.core - INFO - Starting established connection
Now, there are no errors in the logs. Indeed, I inspect the running processes and I see this:
Worker machine
worker 8974 1.6 0.1 394088 45044 ? Sl 16:04 0:14 /data/anaconda3/bin/python /data/anaconda3/bin/dask-worker scheduler.myco.com:9796 --dashboard-prefix my_workflow
Scheduler container:
root 33 1.4 1.4 670884 115056 pts/0 Sl 16:04 0:15 /conda/bin/python /conda/bin/dask-scheduler --port 9796 --dashboard-address 172.17.0.2:9797 --dashboard-prefix my_workflow
Notice the 172.17.0.2 address is the address inside the scheduler container. If I try to initiate the dask-address as the hostname of the host machine instead I get this error [Errno 99] Cannot assign requested address
presumably because the port 9797 is already taken by the docker container.
Anyway. These processes are still running, yet to my knowledge, they’re not working on the workflow I tried to pass to the worker. Can you please help me understand what I’m doing wrong to produce the error above?
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (8 by maintainers)
Top GitHub Comments
I too am facing this issue and it won’t go away with dask/distributed 2.20
I created a reproducible example. It isn’t minimal but fairly compact. Based on the stacktrace the error seems to have a different code path than what previous people have reported.
The first run is usually successful but executing it a second time immediately after it leads to the error. I have no dask config files.
I cannot reproduce it with
pure=False
and I can also not reproduce it when settingn_workers=2
with noadapt()
.Typical output
edit: python 3.8.6 dask 2020.12.0 distributed 2021.01.1 tornado 6.1