question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FargateCluster container: scheduler exited unexpectedly!

See original GitHub issue

This works fine (takes about 2 minutes):

from dask_cloudprovider import FargateCluster
cluster = FargateCluster(n_workers=1, image='rsignell/pangeo-worker:2020-01-23c')

but then when I added numba, holoviews and datashader to the container environment and tried again:

from dask_cloudprovider import FargateCluster
cluster = FargateCluster(n_workers=1, image='rsignell/pangeo-worker:2020-01-28')

I get:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, **kwargs)
   1099 
   1100     def __init__(self, **kwargs):
-> 1101         super().__init__(fargate_scheduler=True, fargate_workers=True, **kwargs)
   1102 
   1103 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, fargate_scheduler, fargate_workers, image, scheduler_cpu, scheduler_mem, scheduler_timeout, worker_cpu, worker_mem, worker_gpu, n_workers, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, **kwargs)
    593         self._region_name = region_name
    594         self._lock = asyncio.Lock()
--> 595         super().__init__(**kwargs)
    596 
    597     async def _start(self,):

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
    254         if not self.asynchronous:
    255             self._loop_runner.start()
--> 256             self.sync(self._start)
    257             self.sync(self._correct_state)
    258 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    160             return future
    161         else:
--> 162             return sync(self.loop, func, *args, **kwargs)
    163 
    164     async def _logs(self, scheduler=True, workers=True):

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    343     if error[0]:
    344         typ, exc, tb = error[0]
--> 345         raise exc.with_traceback(tb)
    346     else:
    347         return result[0]

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/utils.py in f()
    327             if callback_timeout is not None:
    328                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 329             result[0] = yield future
    330         except Exception as exc:
    331             error[0] = sys.exc_info()

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _start(self)
    765             "Hang tight! ",
    766         ):
--> 767             await super()._start()
    768 
    769     @property

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/spec.py in _start(self)
    282 
    283         self.status = "starting"
--> 284         self.scheduler = await self.scheduler
    285         self.scheduler_comm = rpc(
    286             getattr(self.scheduler, "external_address", None) or self.scheduler.address,

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _()
    128             async with self.lock:
    129                 if not self.task:
--> 130                     await self.start()
    131                     assert self.task
    132             return self

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
    258             self.public_ip = interface["Association"]["PublicIp"]
    259         self.private_ip = interface["PrivateIpAddresses"][0]["PrivateIpAddress"]
--> 260         await self._set_address_from_logs()
    261         self.status = "running"
    262 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _set_address_from_logs(self)
    181             else:
    182                 if not await self._task_is_running():
--> 183                     raise RuntimeError("%s exited unexpectedly!" % type(self).__name__)
    184                 continue
    185             break

RuntimeError: Scheduler exited unexpectedly!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
rsignell-usgscommented, Feb 3, 2020

@jacobtomlinson , yep, the logs showed me the problem, which was a docker container error. In my description of the problem, I said I just added a few packages to a dockerfile that worked, and then it didn’t work. But when I added the packages, I added them to the wrong dockerfile. So just user error. Nothing to do with dask-cloudprovider.

0reactions
jacobtomlinsoncommented, Feb 3, 2020

Sure that sounds reasonable!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot ECS tasks stopping or failing to start
My Amazon Elastic Container Service (Amazon ECS) container exits unexpectedly, and tasks stop or fail to start. How can resolve the issue?
Read more >
dask_cloudprovider.aws.ecs - Dask Cloud Provider
scheduler_timeout : str (optional) The scheduler task will exit after this amount of time if there are no clients connected. Defaults to ``5...
Read more >
AWS ECS error: Task failed ELB health checks in Target group
as it is fargate cluster, I am not getting how to login to container and execute some health check queries to debug further....
Read more >
How to create an AWS Fargate cluster - YouTube
Full Azure DevOps on AWS tutorial https://www.youtube.com/playlist?list=PLwpoxH5mFQS11T7ez_C_Iib8Hk_r7T-AD AWS Fargate cluster provides ...
Read more >
Diving into Amazon ECS task history with Container Insights
If your application crashes unexpectedly, or the instance hosting an application goes down, then ECS can restart your application automatically ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found