Rearchitecture: Kubernetes Backend
See original GitHub issueIn #195 server architecture was redone to improve resiliency/performance and make the kubernetes backend more kubernetes-native. In that PR we dropped support entirely for kubernetes, support will be added back in a subsequent PR. Here I’ll outline how I think the new Kubernetes backend should be architected.
When running on Kubernetes, dask-gateway
will be composed of the following components:
- A dask-gateway CRD object to describe specification/state about a single dask cluster
- One or more instances of the
dask-gateway
api server, running behind a ClusterIP Service. These provide the internal api routing, but users won’t (usually) connect directly to the api server. - One or more instances of
traefik
configured to use the Kubernetes CRD provider (https://docs.traefik.io/providers/kubernetes-crd/). These will handle proxying both the scheduler and dashboard routes. - One instance of a controller, managing dask-gateway CRD objects.
When initially installed, only a single IngressRoute
object will be created to expose the dask-gateway
api server through the proxy. As dask-clusters come and go, additional routes will be added/removed. Users will only communicate with the traefik
proxy service directly, remaining network traffic doesn’t need to be exposed directly.
Walking through the components individually.
Dask-Gateway CRD
An instance of a Dask-Gateway CRD object needs to contain information on how to run a single dask cluster. Here’s a sketch of a proposed schema (in pseudocode):
spec:
schedulerTemplate: ... # A pod template for what the scheduler pod should look like
workerTemplate: ... # A pod template for what a single worker pod should look like
ingressRoutes: ... # A list of IngressRoute objects to create after the scheduler pod is running
replicas: 0 # The number of workers to create, initially 0
status:
# Status tracking things go here, not sure what we'd need to track
Each dask cluster created by the gateway will have one of these objects created.
Backend Class
To plugin to the dask-gateway api server, we’d need to write a kubernetes backend. We may need to tweak the interface exposed by the dask_gateway_server.backends.base.Backend
class, but I think the existing interface is pretty close to what we’ll want.
Walking through the methods:
This is called when a new cluster is requested for a user
. This should validate the request, then compose and create the dask-gateway CRD object and corresponding Secret (storing the TLS credentials and api token). It should return the object name (or some proxy for it) so that it can be looked up again. Note that when this method returns the cluster doesn’t need to be running, only submitted.
This is called when a cluster is requested to stop. This should do whatever is needed to stop the CRD object, and should be a no-op if it’s already stopped. As above, this doesn’t need to wait for the cluster to stop, only for it to be stopping.
If we want to keep some job history then stop_cluster
can’t delete the CRD object, since that’s the only record that the job happened. If we don’t care about this then deleting the CRD is fine. Other backends keep around a short job history, but that may not matter here.
This is called when a user wants to lookup or connect to a cluster. It provides the cluster name (as returned by start_cluster
. It should return a dask_gateway_server.models.Cluster
object.
I think what we’ll want is for each instance of the api server to watch for objects required to compose the Cluster
model (dask-gateway CRD objects, secrets with a label selector, etc…). If the needed object is already reflected locally, we can use that. If it’s not, we can try querying the k8s server in case the object was created and hasn’t propagated to us yet (and then store it in the reflector locally). It should be fine to return slightly outdated information from this call.
This is used to query clusters, optionally filtering on username and statuses. Like get_cluster
above, this can be slightly out of date, so I think returning only the reflected view is fine. If not, this should map pretty easily onto a k8s query with a label selector (a cluster’s username
will be stored in a label).
This responds to messages from the cluster to the gateway. For the k8s backend, we only really need to handle messages that contain changes to the cluster worker count. These will translate to patch updates to the CRD objects, updating the replicas
field.
Controller
The controller handles the bulk of the logic around when/how to run pods. It should watch for updates to the CRD objects, and update the child resources (pods, …) accordingly. Walking through the logic:
-
When a new CRD is created, the controller will create a scheduler pod from the scheduler template, along with any
IngressRoute
objects specified in the CRD. The scheduler pod should have the CRD as an owner reference so it’ll be automatically cleaned up on CRD deletion. Likewise theIngressRoute
objects should have the scheduler pod as an owner reference. -
When the scheduler pod is running, it will transition the CRD status to RUNNING. This signals to the API server that the cluster is now available.
-
When the
replicas
value is higher than the current number of child workers (scale up event), additional worker pods will be created. The worker pods should be created with the scheduler pod as an owner reference so that they’ll be automatically cleaned up. -
When the
replicas
value is lower than the current number of child workers the controller should do nothing. All worker pods should be configured withrestartPolicy: onFailure
. When a cluster is scaling down, the workers will exit successfully and the pods will stop themselves. -
When the CRD is signaled to shutdown, the scheduler pod should be deleted, which will delete the worker pods as well. We’ll also need to delete the corresponding Secret. Perhaps
start_cluster
sets the CRD as the owner reference of the secret, and the controller is robust to CRDs being created before their corresponding Secret? Not sure.
We want to support running clusters in multiple namespaces in the same gateway. I’m not sure what the performance implications of watching all namespaces are vs a single namespace - if it’s significant we may want to make this configurable, otherwise we may watch all namespaces always even if we only deploy in a single namespace.
We’ll also want to be resilient to pod creation being denied due to a Resource Quota.
I’ve written a controller with the Python k8s api and it was pretty simple, but I’d be happy writing this using operator-sdk or kubebuilder as well. No strong opinions.
Since this splits up the kubernetes implementation into a couple of smaller services, any meaningful testing will likely require all of them running. I’m not familiar with the tooling people use for this (see #196), advice here would be useful. We can definitely write unit tests for the smaller components, but an integration test will require everything running.
Issue Analytics
- State:
- Created 4 years ago
- Comments:35 (35 by maintainers)
Top GitHub Comments
In this case
spec.replicas
would be8
, andstatus.replicas
would be10
. Since the controller never responds to scale down events, andstatus.replicas > spec.replicas
, there’s nothing to do.Hmmm, that’s interesting. I still think we’d be better served by managing the pods ourselves directly though. While the client-side operations you’ve done above may happen quickly, I suspect the kubernetes core bits managing the ReplicaSet being created/deleted in quick succession will have some measurable overhead compared to a more custom solution. That’s a clever solution though, wouldn’t have thought of that.