Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FargateCluster container: scheduler exited unexpectedly!

See original GitHub issue

This works fine (takes about 2 minutes):

from dask_cloudprovider import FargateCluster
cluster = FargateCluster(n_workers=1, image='rsignell/pangeo-worker:2020-01-23c')

but then when I added numba, holoviews and datashader to the container environment and tried again:

from dask_cloudprovider import FargateCluster
cluster = FargateCluster(n_workers=1, image='rsignell/pangeo-worker:2020-01-28')

I get:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, **kwargs)
   1099 
   1100     def __init__(self, **kwargs):
-> 1101         super().__init__(fargate_scheduler=True, fargate_workers=True, **kwargs)
   1102 
   1103 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, fargate_scheduler, fargate_workers, image, scheduler_cpu, scheduler_mem, scheduler_timeout, worker_cpu, worker_mem, worker_gpu, n_workers, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, **kwargs)
    593         self._region_name = region_name
    594         self._lock = asyncio.Lock()
--> 595         super().__init__(**kwargs)
    596 
    597     async def _start(self,):

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
    254         if not self.asynchronous:
    255             self._loop_runner.start()
--> 256             self.sync(self._start)
    257             self.sync(self._correct_state)
    258 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    160             return future
    161         else:
--> 162             return sync(self.loop, func, *args, **kwargs)
    163 
    164     async def _logs(self, scheduler=True, workers=True):

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    343     if error[0]:
    344         typ, exc, tb = error[0]
--> 345         raise exc.with_traceback(tb)
    346     else:
    347         return result[0]

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/utils.py in f()
    327             if callback_timeout is not None:
    328                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 329             result[0] = yield future
    330         except Exception as exc:
    331             error[0] = sys.exc_info()

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _start(self)
    765             "Hang tight! ",
    766         ):
--> 767             await super()._start()
    768 
    769     @property

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/distributed/deploy/spec.py in _start(self)
    282 
    283         self.status = "starting"
--> 284         self.scheduler = await self.scheduler
    285         self.scheduler_comm = rpc(
    286             getattr(self.scheduler, "external_address", None) or self.scheduler.address,

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _()
    128             async with self.lock:
    129                 if not self.task:
--> 130                     await self.start()
    131                     assert self.task
    132             return self

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
    258             self.public_ip = interface["Association"]["PublicIp"]
    259         self.private_ip = interface["PrivateIpAddresses"][0]["PrivateIpAddress"]
--> 260         await self._set_address_from_logs()
    261         self.status = "running"
    262 

~/SageMaker/myenvs/pangeo/lib/python3.6/site-packages/dask_cloudprovider/providers/aws/ecs.py in _set_address_from_logs(self)
    181             else:
    182                 if not await self._task_is_running():
--> 183                     raise RuntimeError("%s exited unexpectedly!" % type(self).__name__)
    184                 continue
    185             break

RuntimeError: Scheduler exited unexpectedly!

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

rsignell-usgscommented, Feb 3, 2020

@jacobtomlinson , yep, the logs showed me the problem, which was a docker container error. In my description of the problem, I said I just added a few packages to a dockerfile that worked, and then it didn’t work. But when I added the packages, I added them to the wrong dockerfile. So just user error. Nothing to do with dask-cloudprovider.

0reactions

jacobtomlinsoncommented, Feb 3, 2020

Sure that sounds reasonable!