Calling "cluster.adapt()" multiple times causes strange behavior and `KilledWorker` errors
See original GitHub issueWhat happened:
We tried scaling an adaptive cluster twice like this:
cluster.adapt(maximum=16)
cluster.adapt(maximum=8)
After submitting tasks to the cluster workers will crash with
KilledWorker: ('sample_pi_monte_carlo-c4b060d1-c020-4aa1-b1ea-9f7fadeb881a', <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 0, processing: 94>)
and logs on the scheduler show that it scales up the cluster then immediately retires the workers:
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Retire worker names (1, 2, 3, 4, 5, 6, 7, 8)
distributed.deploy.adaptive - INFO - Retiring workers [1, 2, 3, 4, 5, 6, 7, 8]
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.11.43:33153', name: 12, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.11.43:33153
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.6.29:35277
[more register worker / starting worker messages]
...
[more remove worker / removing comms messages]
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 14, processing: 85>
distributed.core - INFO - Removing comms to tcp://10.56.6.29:35277
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 14, processing: 94>
distributed.core - INFO - Removing comms to tcp://10.56.4.77:40643
What you expected to happen:
Calling cluster.adapt()
multiple times should have no effect if the cluster size is within bounds since you’re only changing the minimum and maximum number of workers.
Minimal Complete Verifiable Example:
import time
import dask
import numpy as np
from dask.distributed import Client
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_dict(...)
client = Client(cluster)
cluster.adapt(maximum=16)
cluster.adapt(maximum=8)
def sample_pi_monte_carlo(sleep_time):
x = np.random.uniform(-1, 1)
y = np.random.uniform(-1, 1)
time.sleep(sleep_time)
return (np.sqrt(x**2 + y**2) < 1)
results = client.compute([
dask.delayed(sample_pi_monte_carlo)(0.1)
for _ in range(1000)
], sync=True)
print("pi is:", 4 * np.mean(results))
Anything else we need to know?:
This uses dask_kubernetes
to implement adaptive scaling. I originally reported this at https://github.com/dask/dask-kubernetes/issues/250 but it might be an upstream issue (their KubeCluster
class inherits from SpecCluster
doesn’t add any logic to .adapt()
)
Environment:
- Dask version: 2.15.0
- Python version: 3.7
- Operating System: Ubuntu 18
- Install method (conda, pip, source): pip
Let me know if there’s anything else I can do to help!
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (8 by maintainers)
Top GitHub Comments
Sorry for the slow response, but thanks for looking into this!! The local example seems to have been fixed for me too.
Hopefully resolved in #3915