Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resource Manager Interface

See original GitHub issue

Resource Manager Interface

The first step is hard, but standing up is harder.

Deploying distributed on a remote cluster is a large hurdle between the curious user and serious adoption. Efforts to improve deployment can significantly improve community uptake and value.

Unfortunately, there are a large number of deployment solutions so there is no single piece of code we can write to satisfy most potential users. We have direct relations with people who employ the following deployment technologies:

YARN
Mesos
SLURM
Torque
SGE
SSH
Local processes

Supporting each of these is doable in isolation, but when we consider supporting all in aggregate we desire a common framework.

Additionally, near-future versions of the distributed scheduler may want to dynamically request and release worker resources depending on load. Providing a single resource management interface to the scheduler would allow integration with several technologies in a sane and maintainable manner.

Question: What does such a resource management interface look like?

Example Interface

Let us consider the following signals from scheduler to resource manager (RM) and back.

Scheduler -> RM: Please give me more workers
Scheduler -> RM: I no longer need these workers
RM -> Scheduler: I plan to take back this worker very soon
RM -> Scheduler: This worker has unexpectedly died

We could implement this as a Python ResourceManger object that has fuctions for “Please give me more workers” and “I no longer need these workers” that the scheduler can call as well as callbacks that the scheduler provides on ResourceManager creation for “I plan to take back this worker soon” and “this worker has unexpectedly died”.

class ResourceManager(object):
    def __init__(self, on_reclaiming_worker, on_worker_died, **kwargs):
        ...

    def request_workers(self, nworkers, **kwargs):
        ...

    def release_workers(self, worker_ids):
        ...

We might then subclass this interface for processes, SSH, Mesos, Yarn, etc…

Questions

Different resource management systems employ different information. Yarn allows the specification of CPUs and Memory and the use of containers, HPC job schedulers have fixed job durations, Mesos operates a bit differently from all.

The example operations provided above may not fit all resource managers. Some may provide only a subset of this functionality while some may have a superset. How do we balance the needs to specialize to a particular system while also maintaining compatibility with many systems and keeping code simple.

cc @hussainsultan @broxtronix @quasiben @danielfrg

Issue Analytics

State:
Created 8 years ago
Reactions:7
Comments:17 (14 by maintainers)

Top GitHub Comments

2reactions

ellisonbgcommented, Feb 19, 2016

Subscribing to the discussion…

1reaction

ogriselcommented, Mar 15, 2016

Support for SLURM and SGE can be implemented at once by reusing the lightweigth abstraction provided by https://github.com/clusterlib/clusterlib.

Top Results From Across the Web

Resource Manager Interface (RMF) - IBM

Resource Manager Interface (RMF) ... Support to enable RMF to provide performance measurements on these selected ICSF services and functions. The measurements ...

Interface and navigation - Zenoss Resource Manager

It is the primary window into devices and events that the system enables you to monitor. For more information about the Resource Manager...

Overview of Resource Manager - Oracle Help Center

Resource Manager is an Oracle Cloud Infrastructure service that allows you to automate the process of provisioning your Oracle Cloud ...

Chapter 3. The Resource Manager - JBoss.org

The XAResource interface defines the contract between a ResourceManager and a TransactionManager in a distributed transaction processing environment. A resource ...

Interface ResourceManager (1.7.0) | Java client library

Deprecated. v3 GAPIC client of ResourceManager is now available. An interface for Google Cloud Resource Manager. See Also: Google Cloud Resource Manager ......