question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timed out trying to connect to host after 10 s

See original GitHub issue

I have a dask distributed cluster up and running on 40 workers:

dask_client = Client('localhost:8786')
dask_client.restart()
dask_client

I’ve restarted everything so no tasks are queued and the scheduler log shows:

distributed.scheduler - INFO - Clear task state

I have a large csr sparse matrix that I am scattering to the cluster:

csr_future = dask_client.scatter(csr, broadcast=True)

After a few seconds, I see:

distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection

So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:

tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
    result = yield result
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
    yield self.replicate(keys=keys, workers=workers, n=n)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
    for w, who_has in gathers.items()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused

It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms from earlier up above. I can’t seem to resolve this.

In case it matters, I am running these commands in a jupyterlab=0.35.5 that is running next to the dask-scheduler and we are running tornado=6.0.2 with dask=1.2.2.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:30 (9 by maintainers)

github_iconTop GitHub Comments

5reactions
TomAugspurgercommented, Nov 19, 2019

Seems like this fixes it for me locally

diff --git a/distributed/worker.py b/distributed/worker.py
index e3ef6b26..e5666aca 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -882,6 +882,12 @@ class Worker(ServerNode):
                 )
                 self.bandwidth_workers.clear()
                 self.bandwidth_types.clear()
+            except IOError as e:
+                # Scheduler is gone. Respect distributed.comm.timeouts.connect
+                if "Timed out trying to connect" in str(e):
+                    await self.close(report=False)
+                else:
+                    raise e
             except CommClosedError:
                 logger.warning("Heartbeat to scheduler failed")
             finally:

I start a scheduler with dask-scheduler, connect a worker, then kill the scheduler. Going to write up a test now and make a PR

cc @scottyhq

1reaction
alvarouccommented, Oct 25, 2020

I am using a local cluster and got the same error. Commenting this change out worked for me. As an alternative could you advise on how to change the “distributed.comm.timeouts.connect” setting?

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask/dask - Gitter
"OSError: Timed out trying to connect to gateway://10.100.3.84:32515/dhub.ac2ac55dd9534e018d808aa21fa74ce9 after 10 s". Related ip address belongs to proxy- ...
Read more >
Timed out after 10s:Sql server connection error - Get Help
I'm trying to connect metabase with sql server 12.0.4100.1 but getting this error. bellow is the error message that comes in the cmd...
Read more >
Dask tasks failing because they timed out trying to connect
OSError: Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/1' after 10 s: connect() didn't finish in time.
Read more >
How to Fix the ERR_CONNECTION_TIMED_OUT Error - Kinsta
According to the WordPress support documentation, a connection timed out error appears when your website is trying to do more than your server...
Read more >
Troubleshoot connecting to your instance - AWS Documentation
Error connecting to your instance: Connection timed out · For Linux instances: Verify that there is a rule that allows traffic from your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found