question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes, disable dask scheduler and workers auto resheduling

See original GitHub issue

Hello,

The issue is more related to Kuberntes and GCP but anyway I want to get some pieces of advice. I have created a dynamic Dask k8s cluster (using dask-kubernetes) on GCP and set up node autoscaling. An initial state of dask cluster is one scheduler pod (with KubeCluster) and one worker pod (created by scheduler). Everything is working well, but when scheduler starts to add new workers (due to high load) and GCP begin to scale up nodes, quite often I can experience a scheduler pod rescheduling. Kubernetes or GCP decides to delete scheduler and recreate it on another node. Of course, because of that all tasks are deleted, I receive an error and cluster becomes unstable. Have you ever experienced such behavior?

To tackle this problem, I have added nodeSelector to the scheduler pod definition and it working good (at least it looks like this). But, also, the same situation appears for the workers. They are deleted and recreate and you lose your results. In this situation, you cannot easily setup nodeSelector labels to dynamically create workers.

It would be really great to have a feature (property) in pod definition that says: “Don’t delete/move this pod until it will fail or succeed”. Is it makes sense add such functionality to k8s project?

Maybe, you have ideas for a different solution?

Thank you for your attention.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:18 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jacobtomlinsoncommented, Dec 17, 2018

As this is still ticking along and is not actually an issue with dask-kubernetes but instead a dask kubernetes use case I’m going to close this again in favor of the stack overflow question.

1reaction
VMoiscommented, Nov 27, 2018

A scheduler pod is based on dask-kuberntes and my own image (link). When I tested maxUnavailable last time I have checked all labels few times and my pod is part of Deployment. Maybe, I missed something. I will try to test it one more time today and comment here results. I’m also surprised that maxUnavailable is not working as expected.

Thanks for reopening this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Disable auto rescheduling for a pod - kubernetes
This will create the nodepool with the label app=dask-scheduler, after in the pod spec, you can do this: nodeSelector: app: dask-scheduler.
Read more >
Setup adaptive deployments - Dask documentation
Adaptively allocate workers based on scheduler load. A superclass. Contains logic to dynamically resize a Dask cluster based on current use. This class...
Read more >
API — Dask.distributed 2022.12.1 documentation
warn() function for issuing warnings remotely from workers to clients. Reschedule. Reschedule this task. ReplayTaskClient.recreate_task_locally ...
Read more >
Source code for distributed.scheduler - Dask documentation
If " "your deployment system does not automatically re-launch terminated " "processes, then those workers will never come back, and `Client.restart` " "will ......
Read more >
Source code for distributed.client - Dask documentation
It is also common to create a Client without specifying the scheduler address ... to automatically check) Whether or not to connect directly...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found