Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask cluster errors when adaptive scaling is enabled with a custom scheduler

See original GitHub issue

I was asked to repost this here - https://github.com/dask/dask/issues/8938

Dask version: daskdev/dask:2022.1.0

Python version: 3.8

Operating System: Ubuntu

Install method (conda, pip, source): pip

I am running into an issue when I create a fargate cluster in a context where the scheduler is already available.

the fargate cluster is spun up on ECS and uses a version of this python script. I let the instance run all the time so I start the cluster and leave it running

cluster = FargateCluster(
    vpc=os.getenv("VPC_ID"),
    region_name="us-west-2",
    subnets=[...],
    image=os.getenv("DASK_IMAGE"),
    n_workers=1,
    cluster_arn=os.getenv("FARGATE_CLUSTER_ARN"),
    mount_points=[...],
    volumes=[...],
    fargate_use_private_ip=True,
    scheduler_address=os.getenv("SCHEDULER_ADDRESS"),
    environment = {...}
)

cluster.adapt(minimum=1, maximum=int(os.getenv("DASK_MAX_WORKERS")))

print(f"Cluster Initialized {cluster}")

while True:
    time.sleep(1)

What happened:

Things work really well in the beginning. After a few iterations of tasks running on the cluster, I notice errors in the console

2022-04-14T14:26:10.151-07:00 | tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb7599e22b0>>, <Task finished name='Task-784707' coro=<AdaptiveCore.__init__.<locals>._adapt() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:119> exception=TypeError("object SchedulerAddress can't be used in 'await' expression")>)
-- | --
  | 2022-04-14T14:26:10.151-07:00 | Traceback (most recent call last):
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
  | 2022-04-14T14:26:10.151-07:00 | ret = callback()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
  | 2022-04-14T14:26:10.151-07:00 | future.result()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 122, in _adapt
  | 2022-04-14T14:26:10.151-07:00 | await core.adapt()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 229, in adapt
  | 2022-04-14T14:26:10.151-07:00 | recommendations = await self.recommendations(target)
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 157, in recommendations
  | 2022-04-14T14:26:10.151-07:00 | await self.cluster
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 378, in _
  | 2022-04-14T14:26:10.151-07:00 | await self.scheduler
  | 2022-04-14T14:26:10.151-07:00 | TypeError: object SchedulerAddress can't be used in 'await' expression

What you expected to happen:

After the error message, the autoscaling does not work anymore. If the cluster has 10 dask worker nodes when the message first appears, the cluster cannot increase or decrease the worker nodes

Anything else we need to know?:

All the images are based on docker: daskdev/dask:2022.1.0

Environment:

Dask version: daskdev/dask:2022.1.0
Python version: 3.8
Operating System: Ubuntu
Install method (conda, pip, source): pip

Issue Analytics

State:
Created a year ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

graingertcommented, Apr 20, 2022

Looks like a bug in https://github.com/dask/dask-cloudprovider ?

0reactions

jacobtomlinsoncommented, Jul 7, 2022

@jkanche could you open a new issue with this error?

Top Results From Across the Web

dask cluster errors when adaptive scaling is enabled ... - GitHub

I am running into an issue when I create a fargate cluster in a context where the scheduler is already available.

Correct usage of "cluster.adapt" - Distributed - Dask Forum

I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact...

Setup adaptive deployments - Dask documentation

Scaling Heuristics The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers: The historical runtime...

Dask adaptive scaling is failing with an error - Stack Overflow

I am trying to implement Adapative scaling in Dask. But getting the following error in loop. AttributeError: 'NoneType' object has no attribute ...

Under the Hood with Dask's Cluster Autoscaling - Coiled.io

Autoscaling can save money and improve performance by “right-sizing” your distributed compute Dask clusters as workloads change.