GCPCluster timeout on exit
See original GitHub issueI received this error after creating a cluster, running jobs with it for ~5 hours, and then trying to exit out of that process.
This resulted in having to delete all the instances for the cluster myself.
# create cluster, let jobs run, and then ~5 hours later:
In [3]: exit
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb69c986760>>, <Task finished name='Task-130' coro=<SpecCluster._correct_state_internal() done, defined at /home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py:320> exception=OSError('Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s')>)
Traceback (most recent call last):
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
await self._correct_state()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
comm = await self.live_comm()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
comm = await connect(
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 641, in close_clusters
cluster.close(timeout=10)
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 104, in close
return self.sync(self._close, callback_timeout=timeout)
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 183, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
await self._correct_state()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
comm = await self.live_comm()
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
comm = await connect(
File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s
This is similar to https://github.com/dask/dask-cloudprovider/issues/179 but the timeout here occurs when trying to communicate with the cluster, not when using the GCP API.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (11 by maintainers)
Top Results From Across the Web
Function timeout | Cloud Functions Documentation
Set a timeout that is higher than your expected function execution time. Track the amount of time left during execution and perform cleanup/exit...
Read more >Fixing the Error of i/o timeout when Using Kubernetes Google ...
Kubernetes is a nice offering on Google Cloud Platform. It is pretty easy to create a Kubernetes cluster and deploy software to the...
Read more >Gitlab auto deploy job fails with i/o timeout to cluster api ...
automatic deploy of my gitlab hello-world app fails with a timeout. It seems it cannot connect to cluster api endpoint.
Read more >Container Cluster fails to create when the call to ... - GitHub
Terraform exited, due to a timeout in cluster status polling, without visibly retrying, from what I can see in the trace log. Terraform ......
Read more >kube-proxy - Kubernetes
Service cluster IPs and ports are currently found through ... --conntrack-tcp-timeout-close-wait duration Default: 1h0m0s.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@rpanai could you please raise a separate issue as this
FargateCluster
andGCPCluster
are two separate cluster manager implementations.No youre fine