Kubernetes, disable dask scheduler and workers auto resheduling
See original GitHub issueHello,
The issue is more related to Kuberntes and GCP but anyway I want to get some pieces of advice. I have created a dynamic Dask k8s cluster (using dask-kubernetes
) on GCP and set up node autoscaling. An initial state of dask cluster is one scheduler pod (with KubeCluster) and one worker pod (created by scheduler). Everything is working well, but when scheduler starts to add new workers (due to high load) and GCP begin to scale up nodes, quite often I can experience a scheduler pod rescheduling. Kubernetes or GCP decides to delete scheduler and recreate it on another node. Of course, because of that all tasks are deleted, I receive an error and cluster becomes unstable. Have you ever experienced such behavior?
To tackle this problem, I have added nodeSelector
to the scheduler pod definition and it working good (at least it looks like this). But, also, the same situation appears for the workers. They are deleted and recreate and you lose your results. In this situation, you cannot easily setup nodeSelector
labels to dynamically create workers.
It would be really great to have a feature (property) in pod definition that says: “Don’t delete/move this pod until it will fail or succeed”. Is it makes sense add such functionality to k8s project?
Maybe, you have ideas for a different solution?
Thank you for your attention.
Issue Analytics
- State:
- Created 5 years ago
- Comments:18 (7 by maintainers)
Top GitHub Comments
As this is still ticking along and is not actually an issue with
dask-kubernetes
but instead a dask kubernetes use case I’m going to close this again in favor of the stack overflow question.A scheduler pod is based on dask-kuberntes and my own image (link). When I tested
maxUnavailable
last time I have checked all labels few times and my pod is part ofDeployment
. Maybe, I missed something. I will try to test it one more time today and comment here results. I’m also surprised that maxUnavailable is not working as expected.Thanks for reopening this issue.