question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

After a few months working on this, a few past design decisions are starting to cause some issues.

  • #148: we’re currently limited to a single server model, with the same restrictions as JupyterHub (this is because much of the design is based on JupyterHub).
  • Even though we can support a database for persistence, setting up additional infrastructure to do so places an additional burden on administrators. People are unlikely to do so unless necessary.

At the same time, the external facing API seems fairly solid and useful. I’ve been trying to find a way to rearchitect to allow (but not mandate) running multiple services, without a (mandatory) database to synchronize with, and I think I’ve reached a decent design.

Design Requirements

  • We must be able to support all major backends using the same codebase (as much as possible)
    • Kubernetes
    • Hadoop
    • HPC JobQueues
    • Local processes
    • Extension points for custom backends
  • We must keep install instructions simple. A basic local install must not require installing anything that can’t be installed from PyPI/conda-forge.
  • Performance in a single server-instance must not be overly degraded by any changes. Currently we’ve optimized for the single-server model, so I expect to lose some performance here, but it would be good to keep this use case in mind.
  • Where possible additional infrastructure (e.g. a database) should not be required. If we can rely on existing systems found for each backend we should.
  • When properly configured, each backend should have a way for the gateway to run without a single point of failure (at least one written by us). I don’t see HA for most deployments increasing scalability, but I do see it improving robustness.

Below I present a design outline of how I hope to achieve these goals for the various backends.

Authentication

As with JupyterHub, we have an Authenticator class that configures authentication for the gateway server. If a request comes in without a cookie/token, the authenticator is used to authenticate the user, then a cookie is added to the response so the authenticator won’t be called again. The decrypted cookie is a uuid that maps back to a user row in our backing database.

I propose keeping the authenticator class much the same, but instead of using a uuid as a cookie, I propose using a jwt to store the user information. If a request contains a jwt, it’s validated, and (if valid) information like user name and groups can be extracted from the token. This removes the requirement for a shared database to map cookies to user names - subsequent requests will already contain this information.

Cluster IDs

Currently we mint a new uuid for every created cluster. This is nice as all backends have the same cluster naming scheme, but means we need to rely on a database to map uuids to cluster backend information (e.g. pod name, etc…).

To remove the need for a database, I propose to encode backend information in the cluster id. This means that each backend will have a different looking id, but means we can parse the cluster id to reference it in the backing resource manager instead of using a database to map between our id and the backend’s.

For the following backends, things might look like:

  • Kubernetes: Cluster CRD object name
  • Hadoop: application id
  • HPC job queue/local processes/etc…: requires a database, probably same id scheme as now.

Cluster Managers

To support multiple dask-gateway servers, I found it helpful to split our backends into two categories:

Useful database in the resource manager

  • Kubernetes
  • YARN

No useful database in the resource manager

  • Local processes
  • HPC Job queues

The former category could support all our needed functionality without any need for synchronization outside of requests to the resource manager. The latter would require additional infrastructure on our end if we wanted to run multiple instances of the gateway server.

Walking through my proposed ideal implementations for each backend:

Kubernetes

The proposed design for running dask-gateway on kubernetes in an ideal (IMO) deployment is as follows:

  • A dask-gateway deployment, containing one or more instances of our gateway server.
  • Traefik proxy deployment, containing one or more pods running traefik, configured with an IngressRoute provider. Traefik 2.0 added support for proxying TCP connections dispatched on SNIs, which means it can support all the things we needed for our own custom proxy implementation. The old proxy could still be used, but doesn’t (currently) support multiple instances. Traefik handles all this for us in a scalable manner.
  • A CRD for a dask cluster, and a backing controller (probably written with metacontroller). Making dask clusters a resource allows us to better use kubernetes to store all our application information, removing the need for an external database. It also removes the need to synchronize our application state and the resource manager’s state - they’re identical in this case. Operations like scaling a cluster are just a patch on the CRD. The metacontroller web hooks could run in the dask-gateway-server pods, or could be their own pods - right now I’m leaning towards the former for simplicity.

Listing clusters, adding new clusters, removing old clusters, etc… can all be done with single kubernetes api requests, removing a lot of need for synchronization on our end. It also means admins can use kubectl without worrying about messing with a deployment’s internal state.

Hadoop

The Hadoop resource manager could also be used to track all the application state we care about, but I’m not sure if querying it will be performant enough for us. We likely want to support an optional (or maybe mandatory) database here. Some benchmarking before implementation will need to be done.

In this case, an installation would contain:

  • One or more dask-gateway server instances, behind a load balancer (load balancer not needed if only running one instance).
  • Our own existing custom proxy, or traefik. If using traefik we could maybe make use of JupyterHub’s work to make it configurable via a file provider.
  • Management of individual clusters will be moved to skein’s application master, instead of the gateway server. This will work similar to the kubernetes version above - scaling a cluster will forward the request the application master to handle. Querying for the current scale will query the application master. This will likely increase latency (all requests will have to ping an external service), but remove the bottleneck on the gateway server for handling everything.

HPC job queue systems

HPC job queue systems will require an external DB to synchronize multiple dask-gateway servers when running in HA mode. With some small tweaks, this will mostly look like the existing implementation. I plan to rely on Postgres for this, making use of its SKIP LOCKED feature to implement a work queue for synchronizing spawners, and asyncpg for its fast postgres integration.

Cluster backend base class

Currently we define a base ClusterManager class that each backend has to implement. Each dask cluster gets its own ClusterManager instance, which manages starting/stopping/scaling the cluster.

With the plan described above we’ll need to move the backend-specific abstractions up higher in the stack. I propose the following initial definition (will likely change as I start working on things):

class ClusterBackend:
    async def list_clusters(self, user=None):
        """List all running or pending clusters.

        Parameters
        ----------
        user : str, optional
            If provided, filter on this user name.
        """
        pass

    async def get_cluster(self, user, cluster_id):
        """Get information about a cluster.

        Same output as `list_clusters`, but for one cluster"""
        pass

    async def start_cluster(self, user, **cluster_options):
        """Start a new cluster for this user"""
        pass

    async def stop_cluster(self, user, cluster_id):
        """Stop a cluster.
        
        No-op if already stopped/doesn't exist.
        """
        pass
        
    async def scale_cluster(self, user, cluster_id, n):
        """Scale a cluster to `n` workers."""
        pass

Things like kubernetes and maybe hadoop would implement the ClusterBackend class. We’d also provide an implementation that uses a database to manage the multi-user/cluster state and abstracts on a single-cluster class (probably looking the same as our existing ClusterManager base class).

class DatabaseClusterBackendBase(ClusterBackend):
    """A backend class that uses a database to synchronize between users/clusters.

    HPC job queue backends would use this, as well as other backends lacking a
    sufficiently queryable resource manager."""
    cluster_manager_class = Type(
        klass=ClusterManager,
        help="The cluster manager to use to manage individual clusters"
    )

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
droctothorpecommented, Feb 7, 2020

Having dug a little deeper into the Traefik documentation, @jacobtomlinson, my concerns were unwarranted. Deleting the comment. Thanks for the appropriate pushback.

0reactions
rokroskarcommented, Jul 5, 2021

I was looking for a possible way to set up authentication with JWTs and found this issue, but no other mention of JWT-based authentication. Is this work on-going somewhere?

Read more comments on GitHub >

github_iconTop Results From Across the Web

rearchitecture - Wiktionary
NounEdit. rearchitecture (uncountable). The process of rearchitecting.
Read more >
Re-architecture - word usage - English StackExchange
Re-architecture · "Re-" is a prefix that may be used with pretty much any verb that expresses an action. · You can use...
Read more >
re:architecture
re:architecture is a design oriented firm focused on innovation approaches to urban design and architecture. The firm is based in Boulder, Colorado and ......
Read more >
Re-Architecture is... - Re-Architecture
Re-Architecture provides expertise to assist clients with making their home or building more sustainable, and support lifestyles that are resilient and ...
Read more >
3 Essential Lessons from Re-architecting Systems (or how not ...
A re-architecture puts the viability of the company into play (or at least the product line), so all of those considerations must be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found