Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support adaptive scaling in Helm cluster

See original GitHub issue

As I understand, the different between KubeCluster and HelmCluster is

KubeCluster start scheduler at where the client runs, the worker resources come from Kubernetes.
HelmCluster has a long running scheduler pod in Kubernetes cluster.

My requirement is, I hope there’s a long running scheduler in the cluster and multiple clients can connects this scheduler to submit tasks, the worker resources can come from same kubernetes cluster as scheduler and they can be scale up and down based on the load like what KubeCluster provides.

Seems it’s a combination of KubeCluster and HelmCluster. Did community consider this case when we add Kubernetes support? Is there any technical blockers? If that’s something reasonable, I can help work on this feature request

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

jacobtomlinsoncommented, Jan 8, 2021

Sure I understand that. This is probably the kind of question that would be asked on a forum, which we have discussed creating in the past.

Also, you should let people sponsor you on Github!

That’s very kind of you, but I’ll leave my employer to do the sponsoring 😄.

If you want to give back to the Dask community you can donate via NumFocus.

1reaction

omarsumadicommented, Jan 7, 2021

Hi @jacobtomlinson - I wanted to piggyback off of this exact question to perhaps add some clarity towards people who are looking for Dask as a small-business solution to schedule workflows. By the way, thanks for everything you have done - we need more people like you.

I am at a crossroads for my small business to deploy Dask as a way for our projected ~10 analysts to execute long-running Python computations. Here’s the workflow that I run:

Someone submits their code through our Admin interface
That code is sent to our Django Webserver pod running inside of Kubernetes
Code is to be processed, depending on what the user specifies, by either threads or processes depending on if the GIL is released (such as a Dask-DF operation)
The Number of Workers is known beforehand (our analysts have to specify how many processes/threads they want)

My Attempts: I initially have three ways towards setting up our infrastructure:

Launch the Dask-Helm chart and enable Horizontal Autoscaling by setting a metric to scale off of CPU as shown in articles like these: https://levelup.gitconnected.com/deploy-and-scale-your-dask-cluster-with-kubernetes-9e43f7e24b04

I quickly realized that, according to the API, https://kubernetes.dask.org/en/latest/api.html#dask_kubernetes.HelmCluster.scale, workers are not terminated gracefully, so this won’t work.

Launch the Dask-Helm chart and use my database to keep a count of how many workers I need and how many workers are active (so a Database Push before and After each Dask Process) and manually scale that way using client.cluster.scale(). Problem is, workers are again not terminated gracefully and a running task could be terminated instead.
Using Dask-Kubernetes as you’ve outlined in this post and as I try and see if its right for us below.

The Actual Question: I was wondering if this was the right way to do it, starting from where I left off using KubeCluster:

Code is sent to my Django Webserver inside of a Kubernetes pod
Create a new KubeCluster using a worker-spec for that specific task, and in that case I can define whether I want larger workers for more threads or small workers for more processes, using something like this:

pod_spec = make_pod_spec(image='daskdev/dask:latest',
                          memory_limit='4G', memory_request='4G',
                          cpu_limit=1, cpu_request=1,
                          env={'EXTRA_PIP_PACKAGES': 'fastparquet git+https://github.com/dask/distributed'})
cluster = KubeCluster(pod_spec)
cluster.scale(10)

Scale the Kube Cluster to how much resources was defined by our analyst.
Let Google Kubernetes Engine handle scaling nodes to create space for the Kube Cluster
Close the Kube Cluster by calling cluster.close() and client.close() when that task is done.
Therefore, we don’t handle scaling to Kubernetes, but keep it all within Dask.

Will spread the love if this is answered and I’ve understood that the last implementation as I outlined is the way to go! If I wrote something confusing, I’ll be more than happy to correct myself.

Top Results From Across the Web

Horizontal Pod Autoscaling | Kubernetes

The HorizontalPodAutoscaler controller accesses corresponding workload resources that support scaling (such as Deployments and StatefulSet).

Scaling Kubernetes to Zero (And Back) - Linode

Optimize your K8s cluster resources by scaling Kubernetes to zero, and quickly increase replicas when traffic surges.

Source code for dask_kubernetes.helm.helmcluster

[docs]class HelmCluster(Cluster): """Connect to a Dask cluster deployed ... Enabling you to perform basic cluster actions such as scaling and log retrieval.

Product scaling - Atlassian DC Helm Charts

The Helm charts provision one StatefulSet by default. The number of replicas within this StatefulSet can be altered either declaratively or imperatively. Note ......

NGINX Tutorial: Reduce Kubernetes Latency with Autoscaling

Reduce Kubernetes Latency with Autoscaling (this post) ... The fastest way to install NGINX Ingress Controller is with Helm.