question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCPCluster timeout on exit

See original GitHub issue

I received this error after creating a cluster, running jobs with it for ~5 hours, and then trying to exit out of that process.

This resulted in having to delete all the instances for the cluster myself.

# create cluster, let jobs run, and then ~5 hours later:
In [3]: exit
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb69c986760>>, <Task finished name='Task-130' coro=<SpecCluster._correct_state_internal() done, defined at /home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py:320> exception=OSError('Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s')>)
Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
    comm = await connect(
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 641, in close_clusters
    cluster.close(timeout=10)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 104, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 183, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
    comm = await connect(
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s

This is similar to https://github.com/dask/dask-cloudprovider/issues/179 but the timeout here occurs when trying to communicate with the cluster, not when using the GCP API.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jacobtomlinsoncommented, Jun 3, 2021

@rpanai could you please raise a separate issue as this FargateCluster and GCPCluster are two separate cluster manager implementations.

0reactions
jacobtomlinsoncommented, Jun 4, 2021

No youre fine

Read more comments on GitHub >

github_iconTop Results From Across the Web

Function timeout | Cloud Functions Documentation
Set a timeout that is higher than your expected function execution time. Track the amount of time left during execution and perform cleanup/exit...
Read more >
Fixing the Error of i/o timeout when Using Kubernetes Google ...
Kubernetes is a nice offering on Google Cloud Platform. It is pretty easy to create a Kubernetes cluster and deploy software to the...
Read more >
Gitlab auto deploy job fails with i/o timeout to cluster api ...
automatic deploy of my gitlab hello-world app fails with a timeout. It seems it cannot connect to cluster api endpoint.
Read more >
Container Cluster fails to create when the call to ... - GitHub
Terraform exited, due to a timeout in cluster status polling, without visibly retrying, from what I can see in the trace log. Terraform ......
Read more >
kube-proxy - Kubernetes
Service cluster IPs and ports are currently found through ... --conntrack-tcp-timeout-close-wait duration Default: 1h0m0s.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found