Timed out trying to connect to host after 10 s
See original GitHub issueI have a dask distributed cluster up and running on 40 workers:
dask_client = Client('localhost:8786')
dask_client.restart()
dask_client
I’ve restarted everything so no tasks are queued and the scheduler log shows:
distributed.scheduler - INFO - Clear task state
I have a large csr sparse matrix that I am scattering to the cluster:
csr_future = dask_client.scatter(csr, broadcast=True)
After a few seconds, I see:
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection
So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
result = yield result
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
yield self.replicate(keys=keys, workers=workers, n=n)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
for w, who_has in gathers.items()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms
from earlier up above. I can’t seem to resolve this.
In case it matters, I am running these commands in a jupyterlab=0.35.5
that is running next to the dask-scheduler and we are running tornado=6.0.2
with dask=1.2.2
.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:30 (9 by maintainers)
Top Results From Across the Web
dask/dask - Gitter
"OSError: Timed out trying to connect to gateway://10.100.3.84:32515/dhub.ac2ac55dd9534e018d808aa21fa74ce9 after 10 s". Related ip address belongs to proxy- ...
Read more >Timed out after 10s:Sql server connection error - Get Help
I'm trying to connect metabase with sql server 12.0.4100.1 but getting this error. bellow is the error message that comes in the cmd...
Read more >Dask tasks failing because they timed out trying to connect
OSError: Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/1' after 10 s: connect() didn't finish in time.
Read more >How to Fix the ERR_CONNECTION_TIMED_OUT Error - Kinsta
According to the WordPress support documentation, a connection timed out error appears when your website is trying to do more than your server...
Read more >Troubleshoot connecting to your instance - AWS Documentation
Error connecting to your instance: Connection timed out · For Linux instances: Verify that there is a rule that allows traffic from your...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Seems like this fixes it for me locally
I start a scheduler with
dask-scheduler
, connect a worker, then kill the scheduler. Going to write up a test now and make a PRcc @scottyhq
I am using a local cluster and got the same error. Commenting this change out worked for me. As an alternative could you advise on how to change the “distributed.comm.timeouts.connect” setting?