dask cluster errors when adaptive scaling is enabled with a custom scheduler
See original GitHub issueI was asked to repost this here - https://github.com/dask/dask/issues/8938
- Dask version: daskdev/dask:2022.1.0
- Python version: 3.8
- Operating System: Ubuntu
- Install method (conda, pip, source): pip
I am running into an issue when I create a fargate cluster in a context where the scheduler is already available.
the fargate cluster is spun up on ECS and uses a version of this python script. I let the instance run all the time so I start the cluster and leave it running
cluster = FargateCluster( vpc=os.getenv("VPC_ID"), region_name="us-west-2", subnets=[...], image=os.getenv("DASK_IMAGE"), n_workers=1, cluster_arn=os.getenv("FARGATE_CLUSTER_ARN"), mount_points=[...], volumes=[...], fargate_use_private_ip=True, scheduler_address=os.getenv("SCHEDULER_ADDRESS"), environment = {...} ) cluster.adapt(minimum=1, maximum=int(os.getenv("DASK_MAX_WORKERS"))) print(f"Cluster Initialized {cluster}") while True: time.sleep(1)
What happened:
Things work really well in the beginning. After a few iterations of tasks running on the cluster, I notice errors in the console
2022-04-14T14:26:10.151-07:00 | tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb7599e22b0>>, <Task finished name='Task-784707' coro=<AdaptiveCore.__init__.<locals>._adapt() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:119> exception=TypeError("object SchedulerAddress can't be used in 'await' expression")>) -- | -- | 2022-04-14T14:26:10.151-07:00 | Traceback (most recent call last): | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback | 2022-04-14T14:26:10.151-07:00 | ret = callback() | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result | 2022-04-14T14:26:10.151-07:00 | future.result() | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 122, in _adapt | 2022-04-14T14:26:10.151-07:00 | await core.adapt() | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 229, in adapt | 2022-04-14T14:26:10.151-07:00 | recommendations = await self.recommendations(target) | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 157, in recommendations | 2022-04-14T14:26:10.151-07:00 | await self.cluster | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 378, in _ | 2022-04-14T14:26:10.151-07:00 | await self.scheduler | 2022-04-14T14:26:10.151-07:00 | TypeError: object SchedulerAddress can't be used in 'await' expression
What you expected to happen:
After the error message, the autoscaling does not work anymore. If the cluster has 10 dask worker nodes when the message first appears, the cluster cannot increase or decrease the worker nodes
Anything else we need to know?:
All the images are based on docker: daskdev/dask:2022.1.0
Environment:
- Dask version: daskdev/dask:2022.1.0
- Python version: 3.8
- Operating System: Ubuntu
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created a year ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
Looks like a bug in https://github.com/dask/dask-cloudprovider ?
@jkanche could you open a new issue with this error?