ECSCluster does not de-provision tasks after failing to connect to scheduler
See original GitHub issueWhat happened:
If ECSCluster successfully runs a scheduler task but then fails to connect to the scheduler, an error is raised and the scheduler task is left active. The task has to be manually removed from the ECS cluster and deregistered.
In my case, this happened when I forgot to connect to a VPN network that would provide connectivity to the VPC. The task successfully ran, but no network connection to the scheduler was possible.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 729, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in __init__
self.sync(self._start)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 926, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
comm = await self.scheduler_comm.live_comm()
File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm
comm = await connect(
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to tcp://10.53.13.110:8786 after 10 s
What you expected to happen:
If ECSCluster fails to connect to the running scheduler, it should catch the error and clean up the scheduler task.
Minimal Complete Verifiable Example:
Create an ECSCluster
instance with tasks in an unreachable network.
Anything else we need to know?:
None.
Environment:
- Dask version: 2012.2.0
- Python version: 3.8.6
- Operating System: Debian (python:3.8 Docker image)
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
Troubleshoot issues related to your scheduled tasks in ECS
I have scheduled my Amazon Elastic Container Service (Amazon ECS) task to run periodically. However, my Amazon ECS task isn't triggered.
Read more >ECS unable to place task despite increasing instance count
I'm facing the following problem when creating new instances and increasing the container desired count at the same time. Since the instances ...
Read more >dask_cloudprovider.aws.ecs - Dask Cloud Provider
This setting has no effect for the EC2 launch type. ... str (optional if fargate is true) The ARN of an existing ECS...
Read more >Automating ECS cluster upgrades with CloudFormation and ...
At Xero, we use Amazon ECS-optimised AMI as the base image for our ECS ... ECS scheduler will take care of shifting service...
Read more >Deploy to Amazon Elastic Container Service - GitLab Docs
Note that the production_ecs job fails because ECS Cluster is not connected yet. You'll fix this later. Create an ECS task definition. ECS...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jacobtomlinson It seemed to have worked, thanks
@manuelreyesgomez I’ve seen
struct.error: unpack_from requires a buffer of at least 8 bytes
a bunch recently and is usually due to mismatched Dask versions.Please check your local dask/distributed versions match those in the container you are using.