question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask cluster errors when adaptive scaling is enabled with a custom scheduler

See original GitHub issue

I was asked to repost this here - https://github.com/dask/dask/issues/8938

  • Dask version: daskdev/dask:2022.1.0
  • Python version: 3.8
  • Operating System: Ubuntu
  • Install method (conda, pip, source): pip

I am running into an issue when I create a fargate cluster in a context where the scheduler is already available.

the fargate cluster is spun up on ECS and uses a version of this python script. I let the instance run all the time so I start the cluster and leave it running

cluster = FargateCluster(
    vpc=os.getenv("VPC_ID"),
    region_name="us-west-2",
    subnets=[...],
    image=os.getenv("DASK_IMAGE"),
    n_workers=1,
    cluster_arn=os.getenv("FARGATE_CLUSTER_ARN"),
    mount_points=[...],
    volumes=[...],
    fargate_use_private_ip=True,
    scheduler_address=os.getenv("SCHEDULER_ADDRESS"),
    environment = {...}
)

cluster.adapt(minimum=1, maximum=int(os.getenv("DASK_MAX_WORKERS")))

print(f"Cluster Initialized {cluster}")

while True:
    time.sleep(1)

What happened:

Things work really well in the beginning. After a few iterations of tasks running on the cluster, I notice errors in the console

2022-04-14T14:26:10.151-07:00 | tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb7599e22b0>>, <Task finished name='Task-784707' coro=<AdaptiveCore.__init__.<locals>._adapt() done, defined at /opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py:119> exception=TypeError("object SchedulerAddress can't be used in 'await' expression")>)
-- | --
  | 2022-04-14T14:26:10.151-07:00 | Traceback (most recent call last):
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
  | 2022-04-14T14:26:10.151-07:00 | ret = callback()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
  | 2022-04-14T14:26:10.151-07:00 | future.result()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 122, in _adapt
  | 2022-04-14T14:26:10.151-07:00 | await core.adapt()
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive_core.py", line 229, in adapt
  | 2022-04-14T14:26:10.151-07:00 | recommendations = await self.recommendations(target)
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/adaptive.py", line 157, in recommendations
  | 2022-04-14T14:26:10.151-07:00 | await self.cluster
  | 2022-04-14T14:26:10.151-07:00 | File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 378, in _
  | 2022-04-14T14:26:10.151-07:00 | await self.scheduler
  | 2022-04-14T14:26:10.151-07:00 | TypeError: object SchedulerAddress can't be used in 'await' expression

What you expected to happen:

After the error message, the autoscaling does not work anymore. If the cluster has 10 dask worker nodes when the message first appears, the cluster cannot increase or decrease the worker nodes

Anything else we need to know?:

All the images are based on docker: daskdev/dask:2022.1.0

Environment:

  • Dask version: daskdev/dask:2022.1.0
  • Python version: 3.8
  • Operating System: Ubuntu
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
graingertcommented, Apr 20, 2022
0reactions
jacobtomlinsoncommented, Jul 7, 2022

@jkanche could you open a new issue with this error?

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask cluster errors when adaptive scaling is enabled ... - GitHub
I am running into an issue when I create a fargate cluster in a context where the scheduler is already available.
Read more >
Correct usage of "cluster.adapt" - Distributed - Dask Forum
I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact...
Read more >
Setup adaptive deployments - Dask documentation
Scaling Heuristics​​ The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers: The historical runtime...
Read more >
Dask adaptive scaling is failing with an error - Stack Overflow
I am trying to implement Adapative scaling in Dask. But getting the following error in loop. AttributeError: 'NoneType' object has no attribute ...
Read more >
Under the Hood with Dask's Cluster Autoscaling - Coiled.io
Autoscaling can save money and improve performance by “right-sizing” your distributed compute Dask clusters as workloads change.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found