question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calling "cluster.adapt()" multiple times causes strange behavior and `KilledWorker` errors

See original GitHub issue

What happened:

We tried scaling an adaptive cluster twice like this:

cluster.adapt(maximum=16)
cluster.adapt(maximum=8)

After submitting tasks to the cluster workers will crash with

KilledWorker: ('sample_pi_monte_carlo-c4b060d1-c020-4aa1-b1ea-9f7fadeb881a', <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 0, processing: 94>)

and logs on the scheduler show that it scales up the cluster then immediately retires the workers:

distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Retire worker names (1, 2, 3, 4, 5, 6, 7, 8)
distributed.deploy.adaptive - INFO - Retiring workers [1, 2, 3, 4, 5, 6, 7, 8]
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.11.43:33153', name: 12, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.11.43:33153
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.6.29:35277
[more register worker / starting worker messages]
...
[more remove worker / removing comms messages]
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 14, processing: 85>
distributed.core - INFO - Removing comms to tcp://10.56.6.29:35277
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 14, processing: 94>
distributed.core - INFO - Removing comms to tcp://10.56.4.77:40643

What you expected to happen:

Calling cluster.adapt() multiple times should have no effect if the cluster size is within bounds since you’re only changing the minimum and maximum number of workers.

Minimal Complete Verifiable Example:

import time

import dask
import numpy as np
from dask.distributed import Client
from dask_kubernetes import KubeCluster

cluster = KubeCluster.from_dict(...)
client = Client(cluster)

cluster.adapt(maximum=16)
cluster.adapt(maximum=8)

def sample_pi_monte_carlo(sleep_time):
    x = np.random.uniform(-1, 1)
    y = np.random.uniform(-1, 1)
    time.sleep(sleep_time)
    return (np.sqrt(x**2 + y**2) < 1)

results = client.compute([
    dask.delayed(sample_pi_monte_carlo)(0.1)
    for _ in range(1000)
], sync=True)

print("pi is:", 4 * np.mean(results))

Anything else we need to know?:

This uses dask_kubernetes to implement adaptive scaling. I originally reported this at https://github.com/dask/dask-kubernetes/issues/250 but it might be an upstream issue (their KubeCluster class inherits from SpecCluster doesn’t add any logic to .adapt())

Environment:

  • Dask version: 2.15.0
  • Python version: 3.7
  • Operating System: Ubuntu 18
  • Install method (conda, pip, source): pip

Let me know if there’s anything else I can do to help!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
somewackocommented, Jul 2, 2020

Sorry for the slow response, but thanks for looking into this!! The local example seems to have been fixed for me too.

0reactions
jacobtomlinsoncommented, Jun 22, 2020

Hopefully resolved in #3915

Read more comments on GitHub >

github_iconTop Results From Across the Web

What do KilledWorker exceptions mean in Dask?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly....
Read more >
Correct usage of "cluster.adapt" - Distributed - Dask Forum
I want to use the adaptive scaling for running jobs on HPC clusters, ... progress cluster = SLURMCluster() cluster.adapt(minimum_jobs=1, maximum_jobs=4) def ...
Read more >
Setup adaptive deployments - Dask documentation
Most Dask deployments are static with a single scheduler and a fixed number of workers. This results in predictable behavior, but is wasteful...
Read more >
Understanding Performance — MiniAn documentation
A KilledWorker exception happens when a worker is about to use memory that exceeds memory_limit . Note that this does not imply you...
Read more >
KilledWorker Exception — Coiled documentation
We have already seen a possible cause of this error, but there are many other ... This behaviour is a way to protect...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found