question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resource Manager Interface

See original GitHub issue

Resource Manager Interface

The first step is hard, but standing up is harder.

Deploying distributed on a remote cluster is a large hurdle between the curious user and serious adoption. Efforts to improve deployment can significantly improve community uptake and value.

Unfortunately, there are a large number of deployment solutions so there is no single piece of code we can write to satisfy most potential users. We have direct relations with people who employ the following deployment technologies:

  1. YARN
  2. Mesos
  3. SLURM
  4. Torque
  5. SGE
  6. SSH
  7. Local processes

Supporting each of these is doable in isolation, but when we consider supporting all in aggregate we desire a common framework.

Additionally, near-future versions of the distributed scheduler may want to dynamically request and release worker resources depending on load. Providing a single resource management interface to the scheduler would allow integration with several technologies in a sane and maintainable manner.

Question: What does such a resource management interface look like?

Example Interface

Let us consider the following signals from scheduler to resource manager (RM) and back.

  • Scheduler -> RM: Please give me more workers
  • Scheduler -> RM: I no longer need these workers
  • RM -> Scheduler: I plan to take back this worker very soon
  • RM -> Scheduler: This worker has unexpectedly died

We could implement this as a Python ResourceManger object that has fuctions for “Please give me more workers” and “I no longer need these workers” that the scheduler can call as well as callbacks that the scheduler provides on ResourceManager creation for “I plan to take back this worker soon” and “this worker has unexpectedly died”.

class ResourceManager(object):
    def __init__(self, on_reclaiming_worker, on_worker_died, **kwargs):
        ...

    def request_workers(self, nworkers, **kwargs):
        ...

    def release_workers(self, worker_ids):
        ...

We might then subclass this interface for processes, SSH, Mesos, Yarn, etc…

Questions

Different resource management systems employ different information. Yarn allows the specification of CPUs and Memory and the use of containers, HPC job schedulers have fixed job durations, Mesos operates a bit differently from all.

The example operations provided above may not fit all resource managers. Some may provide only a subset of this functionality while some may have a superset. How do we balance the needs to specialize to a particular system while also maintaining compatibility with many systems and keeping code simple.

cc @hussainsultan @broxtronix @quasiben @danielfrg

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:7
  • Comments:17 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
ellisonbgcommented, Feb 19, 2016

Subscribing to the discussion…

1reaction
ogriselcommented, Mar 15, 2016

Support for SLURM and SGE can be implemented at once by reusing the lightweigth abstraction provided by https://github.com/clusterlib/clusterlib.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resource Manager Interface (RMF) - IBM
Resource Manager Interface (RMF) ... Support to enable RMF to provide performance measurements on these selected ICSF services and functions. The measurements ...
Read more >
Interface and navigation - Zenoss Resource Manager
It is the primary window into devices and events that the system enables you to monitor. For more information about the Resource Manager...
Read more >
Overview of Resource Manager - Oracle Help Center
Resource Manager is an Oracle Cloud Infrastructure service that allows you to automate the process of provisioning your Oracle Cloud ...
Read more >
Chapter 3. The Resource Manager - JBoss.org
The XAResource interface defines the contract between a ResourceManager and a TransactionManager in a distributed transaction processing environment. A resource ...
Read more >
Interface ResourceManager (1.7.0) | Java client library
Deprecated. v3 GAPIC client of ResourceManager is now available. An interface for Google Cloud Resource Manager. See Also: Google Cloud Resource Manager ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found