question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redesign Cluster Managers

See original GitHub issue

Over the past few months we’ve learned more about the requirements to deploy Dask on different cluster resource managers. This has created several projects like dask-yarn, dask-jobqueue, and dask-kubernetes. These projects share some code within this repository, but also add their own constraints. Now that we have more experience, this might be a good time to redesign things both in the external projects and in the central API.

There are a few changes that have been proposed that likely affect everyone:

  1. We would like to separate the Cluster manager object from the Scheduler in some cases
  2. We would like to optionally change how the user specifies cluster size to incorporate things like number of cores or amount of memory. In the future this might also expand to other resources like GPUs.
  3. We would like to build more UI elements around the cluster object, like JupyterLab server extensions

The first two will likely require synchronized changes to all of the downstream projects.

Separate ClusterManager, Scheduler, and Client

I’m going to start calling the object that starts and stops workers, like KubeCluster, ClusterManager from here on.

Previously the ClusterManager, Scheduler, and Client were started from within a notebook and so all lived in the same process, and could act on the same event loop thread. It didn’t matter so much which part did which action. We’re now considering separating all of these. This forces us to think about the kinds of communication they’ll have to do, and where certain decisions and actions take place. At first glance we might assign actions like the following:

  1. ClusterManager
    • Start new workers
    • Force-kill old workers
    • Have UI that allows users to specify a desired makeup of workers, either a specific number or an Adaptive range. I’m going to call this the Worker Spec
  2. Scheduler
    • Maintain Adaptive logic, and determine how many workers should exist given current load
    • Choose workes to remove and remove those workers politely
  3. Client: normal operation, I don’t think that anything needs to change here.

Some of these actions depend on information from the others. In particular:

  • The ClusterManager needs to send the following to the Scheduler
    • The Worker Spec (like amount of desired memory or adaptive range), whenever it changes
  • The Scheduler needs to send the following to the ClusterManager
    • The desired target number of workers given current load (if in adaptive mode)
    • An announcement every time a new worker arrives (to manage the difference between launched/pending and actively running workers)
    • Requests to force-kill and remove workers after it politely kills them

This is all just a guess. I suspect that other things might arise when actually trying to build everything here.

How to specify cluster size

This came up in https://github.com/dask/distributed/issues/2208 by @guillaumeeb . Today we ask users for a number of desired workers

cluster.scale(10)

But we might instead want to allow users to specify amount of desired ram or memory

cluster.scale('1TB')

And in the future people will probably also ask for more complex compositions, like some small workers and some big workers, some with GPUs, and so on.

If we now have to establish a communication protocol between the ClusterManager and the Scheduler/Adaptive then it might be an interesting challenge to make that future-proof.

Further discussion on this topic should probably remain in https://github.com/dask/distributed/issues/2208 , but I wanted to mention it here.

Shared UI

@ian-r-rose has done some excellent work on the JupyterLab extension for Dask. He has also mentioned thoughts on how to include the ClusterManager within something like that as a server extension. This would optionally move the ClusterManager outside of the notebook and into the JupyterLab sidebar. A number of people have asked for this in the past. If we’re going to redesign things I thought we should also include UI in that process to make sure we get any constraints.

Organization

Should the ClusterManager object continue to be a part of this repository or should it be spun out? There are probably costs and benefits both ways.

cc @jcrist (dask-yarn) @jacobtomlinson (dask-kubernetes) @jhamman @guillaumeeb @lesteve (dask-jobqueue) @ian-r-rose (dask-labextension) for feedback.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:18 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Sep 19, 2018

@jhamman some reasons to separate the scheduler from the “thing that controls starting and stopping workers”

  1. You want to separate the lifecycle of the dask cluster from the lifecycle of a notebook kernel session. For example you might want to manage the cluster with a JupyterLab plugin separately from the notebook.
  2. You are using Jupyter from a place that is not on your normal computational node pool (like a login node). You want your scheduler and workers to be close to each other on the network of computational nodes, even if your client is somewhat further away.
0reactions
mrocklincommented, Jun 4, 2020

Yup. Thanks for flagging @jacobtomlinson

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster Redesign/Optimization
Cluster pre-drive test; Post processing and reporting; Coverage/Quality improvement proposal through Physical and logical recommendations ...
Read more >
The redesign idea | YARN Essentials - Packt Subscription
The redesign idea ... Initially, Hadoop was written solely as a MapReduce engine. Since it runs on a cluster, its cluster management components...
Read more >
Why Host Redesign with MCOECN?
When Redesign applications are hosted with the MCOECN: They are running in a large Kubernetes cluster made up of over 60 worker nodes, ......
Read more >
Amazon EKS Improves Cluster Creation and Management in ...
The Amazon Elastic Kubernetes Service (EKS) management console has been redesigned to make it easier to deploy and manage EKS clusters.
Read more >
Cluster details > Split content into tabs (#196059) · Issues · GitLab ...
Cluster details page redesign > GitLab managed applications ... How does the cluster management project affect the experience of finding and reviewing a ......
Read more >

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found