question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ECSCluster does not de-provision tasks after failing to connect to scheduler

See original GitHub issue

What happened:

If ECSCluster successfully runs a scheduler task but then fails to connect to the scheduler, an error is raised and the scheduler task is left active. The task has to be manually removed from the ECS cluster and deregistered.

In my case, this happened when I forgot to connect to a VPN network that would provide connectivity to the VPC. The task successfully ran, but no network connection to the scheduler was possible.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 729, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 926, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
    comm = await self.scheduler_comm.live_comm()
  File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm
    comm = await connect(
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
    raise IOError(
OSError: Timed out trying to connect to tcp://10.53.13.110:8786 after 10 s

What you expected to happen:

If ECSCluster fails to connect to the running scheduler, it should catch the error and clean up the scheduler task.

Minimal Complete Verifiable Example:

Create an ECSCluster instance with tasks in an unreachable network.

Anything else we need to know?:

None.

Environment:

  • Dask version: 2012.2.0
  • Python version: 3.8.6
  • Operating System: Debian (python:3.8 Docker image)
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
manuelreyesgomezcommented, Mar 15, 2021

@jacobtomlinson It seemed to have worked, thanks

1reaction
jacobtomlinsoncommented, Mar 15, 2021

@manuelreyesgomez I’ve seen struct.error: unpack_from requires a buffer of at least 8 bytes a bunch recently and is usually due to mismatched Dask versions.

Please check your local dask/distributed versions match those in the container you are using.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot issues related to your scheduled tasks in ECS
I have scheduled my Amazon Elastic Container Service (Amazon ECS) task to run periodically. However, my Amazon ECS task isn't triggered.
Read more >
ECS unable to place task despite increasing instance count
I'm facing the following problem when creating new instances and increasing the container desired count at the same time. Since the instances ...
Read more >
dask_cloudprovider.aws.ecs - Dask Cloud Provider
This setting has no effect for the EC2 launch type. ... str (optional if fargate is true) The ARN of an existing ECS...
Read more >
Automating ECS cluster upgrades with CloudFormation and ...
At Xero, we use Amazon ECS-optimised AMI as the base image for our ECS ... ECS scheduler will take care of shifting service...
Read more >
Deploy to Amazon Elastic Container Service - GitLab Docs
Note that the production_ecs job fails because ECS Cluster is not connected yet. You'll fix this later. Create an ECS task definition. ECS...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found