Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes, disable dask scheduler and workers auto resheduling

See original GitHub issue

Hello,

The issue is more related to Kuberntes and GCP but anyway I want to get some pieces of advice. I have created a dynamic Dask k8s cluster (using dask-kubernetes) on GCP and set up node autoscaling. An initial state of dask cluster is one scheduler pod (with KubeCluster) and one worker pod (created by scheduler). Everything is working well, but when scheduler starts to add new workers (due to high load) and GCP begin to scale up nodes, quite often I can experience a scheduler pod rescheduling. Kubernetes or GCP decides to delete scheduler and recreate it on another node. Of course, because of that all tasks are deleted, I receive an error and cluster becomes unstable. Have you ever experienced such behavior?

To tackle this problem, I have added nodeSelector to the scheduler pod definition and it working good (at least it looks like this). But, also, the same situation appears for the workers. They are deleted and recreate and you lose your results. In this situation, you cannot easily setup nodeSelector labels to dynamically create workers.

It would be really great to have a feature (property) in pod definition that says: “Don’t delete/move this pod until it will fail or succeed”. Is it makes sense add such functionality to k8s project?

Maybe, you have ideas for a different solution?

Thank you for your attention.

Issue Analytics

State:
Created 5 years ago
Comments:18 (7 by maintainers)

Top GitHub Comments

1reaction

jacobtomlinsoncommented, Dec 17, 2018

As this is still ticking along and is not actually an issue with dask-kubernetes but instead a dask kubernetes use case I’m going to close this again in favor of the stack overflow question.

1reaction

VMoiscommented, Nov 27, 2018

A scheduler pod is based on dask-kuberntes and my own image (link). When I tested maxUnavailable last time I have checked all labels few times and my pod is part of Deployment. Maybe, I missed something. I will try to test it one more time today and comment here results. I’m also surprised that maxUnavailable is not working as expected.

Thanks for reopening this issue.

Top Results From Across the Web

Disable auto rescheduling for a pod - kubernetes

This will create the nodepool with the label app=dask-scheduler, after in the pod spec, you can do this: nodeSelector: app: dask-scheduler.

Setup adaptive deployments - Dask documentation

Adaptively allocate workers based on scheduler load. A superclass. Contains logic to dynamically resize a Dask cluster based on current use. This class...

API — Dask.distributed 2022.12.1 documentation

warn() function for issuing warnings remotely from workers to clients. Reschedule. Reschedule this task. ReplayTaskClient.recreate_task_locally ...

Source code for distributed.scheduler - Dask documentation

If " "your deployment system does not automatically re-launch terminated " "processes, then those workers will never come back, and `Client.restart` " "will ......

Source code for distributed.client - Dask documentation

It is also common to create a Client without specifying the scheduler address ... to automatically check) Whether or not to connect directly...