Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calling "cluster.adapt()" multiple times causes strange behavior and `KilledWorker` errors

See original GitHub issue

What happened:

We tried scaling an adaptive cluster twice like this:

cluster.adapt(maximum=16)
cluster.adapt(maximum=8)

After submitting tasks to the cluster workers will crash with

KilledWorker: ('sample_pi_monte_carlo-c4b060d1-c020-4aa1-b1ea-9f7fadeb881a', <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 0, processing: 94>)

and logs on the scheduler show that it scales up the cluster then immediately retires the workers:

distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Retire worker names (1, 2, 3, 4, 5, 6, 7, 8)
distributed.deploy.adaptive - INFO - Retiring workers [1, 2, 3, 4, 5, 6, 7, 8]
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.11.43:33153', name: 12, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.11.43:33153
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.56.6.29:35277
[more register worker / starting worker messages]
...
[more remove worker / removing comms messages]
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.6.29:35277', name: 6, memory: 14, processing: 85>
distributed.core - INFO - Removing comms to tcp://10.56.6.29:35277
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.56.4.77:40643', name: 7, memory: 14, processing: 94>
distributed.core - INFO - Removing comms to tcp://10.56.4.77:40643

What you expected to happen:

Calling cluster.adapt() multiple times should have no effect if the cluster size is within bounds since you’re only changing the minimum and maximum number of workers.

Minimal Complete Verifiable Example:

import time

import dask
import numpy as np
from dask.distributed import Client
from dask_kubernetes import KubeCluster

cluster = KubeCluster.from_dict(...)
client = Client(cluster)

cluster.adapt(maximum=16)
cluster.adapt(maximum=8)

def sample_pi_monte_carlo(sleep_time):
    x = np.random.uniform(-1, 1)
    y = np.random.uniform(-1, 1)
    time.sleep(sleep_time)
    return (np.sqrt(x**2 + y**2) < 1)

results = client.compute([
    dask.delayed(sample_pi_monte_carlo)(0.1)
    for _ in range(1000)
], sync=True)

print("pi is:", 4 * np.mean(results))

Anything else we need to know?:

This uses dask_kubernetes to implement adaptive scaling. I originally reported this at https://github.com/dask/dask-kubernetes/issues/250 but it might be an upstream issue (their KubeCluster class inherits from SpecCluster doesn’t add any logic to .adapt())

Environment:

Dask version: 2.15.0
Python version: 3.7
Operating System: Ubuntu 18
Install method (conda, pip, source): pip

Let me know if there’s anything else I can do to help!

Issue Analytics

State:
Created 3 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

somewackocommented, Jul 2, 2020

Sorry for the slow response, but thanks for looking into this!! The local example seems to have been fixed for me too.

0reactions

jacobtomlinsoncommented, Jun 22, 2020

Hopefully resolved in #3915

Top Results From Across the Web

What do KilledWorker exceptions mean in Dask?

This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly....

Correct usage of "cluster.adapt" - Distributed - Dask Forum

I want to use the adaptive scaling for running jobs on HPC clusters, ... progress cluster = SLURMCluster() cluster.adapt(minimum_jobs=1, maximum_jobs=4) def ...

Setup adaptive deployments - Dask documentation

Most Dask deployments are static with a single scheduler and a fixed number of workers. This results in predictable behavior, but is wasteful...

Understanding Performance — MiniAn documentation

A KilledWorker exception happens when a worker is about to use memory that exceeds memory_limit . Note that this does not imply you...

KilledWorker Exception — Coiled documentation

We have already seen a possible cause of this error, but there are many other ... This behaviour is a way to protect...