Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ECSCluster does not de-provision tasks after failing to connect to scheduler

See original GitHub issue

What happened:

If ECSCluster successfully runs a scheduler task but then fails to connect to the scheduler, an error is raised and the scheduler task is left active. The task has to be manually removed from the ECS cluster and deregistered.

In my case, this happened when I forgot to connect to a VPN network that would provide connectivity to the VPC. The task successfully ran, but no network connection to the scheduler was possible.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 729, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 926, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
    comm = await self.scheduler_comm.live_comm()
  File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm
    comm = await connect(
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
    raise IOError(
OSError: Timed out trying to connect to tcp://10.53.13.110:8786 after 10 s

What you expected to happen:

If ECSCluster fails to connect to the running scheduler, it should catch the error and clean up the scheduler task.

Minimal Complete Verifiable Example:

Create an ECSCluster instance with tasks in an unreachable network.

Anything else we need to know?:

None.

Environment:

Dask version: 2012.2.0
Python version: 3.8.6
Operating System: Debian (python:3.8 Docker image)
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 3 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

manuelreyesgomezcommented, Mar 15, 2021

@jacobtomlinson It seemed to have worked, thanks

1reaction

jacobtomlinsoncommented, Mar 15, 2021

@manuelreyesgomez I’ve seen struct.error: unpack_from requires a buffer of at least 8 bytes a bunch recently and is usually due to mismatched Dask versions.

Please check your local dask/distributed versions match those in the container you are using.