Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Feedback on Ray Job API

See original GitHub issue

Hi everyone,

I’ve been using the Ray Job API and ray up as part of my work build a Ray scheduler for torchX

I will preface the below by saying I think the Ray Job API is a great feature but I think it needs a bit more work to make it as great to use as the rest of Ray.

I had some feedback to improve its usability and one major limitation I wanted to point out that @amogkam @jiaodong suggested I share.

Major limitation

Right now it seems like working_dir has a 100MB limit which is problematic since it means I don’t have a clean story to load training data, binaries and extra files like label key value stores. This limits how “real” of a usecase this feature can support because even introductory Kaggle problems have datasets larger than 100MB.

Minor limitations and suggested improvements

The autoscaler API returns a dashboard url but doesn’t seem to also return a port so this means I need to figure out the port by looking at the cluster.yaml and this may be solvable by using something like a jinja template but it’s limiting
No connection to placement groups - if I’m submitting more than one job or if there are multiple people using a cluster then I need to constrain the resource requirements for a job - in the torchX Ray scheduler this is done via a driver script but it does feel strange that this doesn’t naturally interop with existing Ray primitives. Ideally I just want a decorator or just an extra parameter in the job submission API to constrain resources
I may be wrong but I don’t see any way of setting up a VPC in a private network which limits enterprise adoption of this feature https://gist.github.com/msaroufim/8623ead1b2dc75b09dbf2847330240e5
Not a Ray specific concern but when using my personal AWS account I needed to request an instance limit increase which after 48 is still not resolved. I wonder if this process can be streamlined assuming people put a reasonable credit card in the AWS console
requirements.txt is not supported and need to turn it into a list. I have the code to do this and can contribute it
Support for multiple directories - in my case I had to create a temporary directory, copy everything in, put that in a working directory and then delete when done
Polling could just use a native function since it’s almost impossible to imagine using the job scheduler without it https://docs.ray.io/en/releases-1.9.0/ray-job-submission/overview.html#rest-api - alternatively would be great to implement it as async promise that gets fulfilled when a job succeeds or fails
JobStatus enum could be expanded to include queued which would be especially useful if I’m submitting several jobs or several people are submitting the same job to the cluster https://github.com/ray-project/ray/blob/master/dashboard/modules/job/common.py#L39
More of request to ray up but I wish there was an easy way to share the autoscaler ssh keys so multiple people can use the cluster I setup
A more natural way to submit distributed jobs - again maybe torchX with a driver script is the answer but submitting a job with environment variables as a baseline would make a lot of things simpler since for example pytorch distributed requires setting some environment variables like rank and world size

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

igorgadcommented, Feb 4, 2022

I’m particularly excited about the ray job API as an opportunity to move from clusters dedicated to a specific project to a generic cluster that could handle multiple workloads across the entire team.

Allow me to add a couple of points:

It should be easier to run certain scripts on specific nodes, like a non-ray training script on a GPU node. I have been using the following workaround for this particular case but IMO it should be integrated with the API.

@ray.remote
def remote_executor(command):
    runtime_env = ray.get_runtime_context().runtime_env
    ray_address = ray._private.services.get_ray_address_to_use_or_die()
    os.environ[RAY_JOB_CONFIG_JSON_ENV_VAR] = json.dumps({"runtime_env": runtime_env})
    os.environ[RAY_ADDRESS_ENVIRONMENT_VARIABLE] = ray_address
    child_process = subprocess.Popen(command,
                                     shell=True,
                                     start_new_session=True)

    parent_pid = os.getpid()
    child_pid = child_process.pid
    child_pgid = os.getpgid(child_pid)
    subprocess.Popen(
        f"while kill -s 0 {parent_pid}; do sleep 5; done; kill -9 -{child_pgid}",  # noqa: E501
        shell=True,
        # Suppress output
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )
    
    while child_process is not None:
        return_code = child_process.poll()
        if return_code is not None:
            # subprocess finished with return code
            return return_code
        else:
            # still running, yield control
            time.sleep(SUBPROCESS_POLL_PERIOD_S)

executor = remote_executor.options(num_cpus=num_cpus,
                                 num_gpus=num_gpus,
                                 resources={gpu_type: num_gpus},
                                 runtime_env={'conda': conda_path},
                                 name=jobid,
                                 max_retries=0,
                                 placement_group=pg)
ray.get(executor.remote(entrypoint))

Having something like ray job list outputting a table would be cool to check queued and running jobs. []s

2reactions

msaroufimcommented, Feb 2, 2022

Thanks @edoakes

On the point about working_dir size limitation, I don’t think we’re likely to increase this because it’s not intended to be used for very large files (e.g., datasets in Kaggle). The best practice for these is to store them in cloud storage (e.g., S3, HDFS) and load them in the application.

This actually sounds fine, might just need to mention this in docs or source code for working_dir

What do you mean by “a more natural way to submit distributed jobs?” I’m not sure what you mean by “distributed” here – the job can be a normal Ray program that runs actors & tasks across the cluster, is there another type of “distributed job” you’re looking for?

Added more detail here, there’ scripts like in PyTorch distributed where you need to set environment variables like rank and world size that. Had to do this to make things work

Top Results From Across the Web

Ray Jobs REST API — Ray 2.2.0 - the Ray documentation

We provide an OpenAPI specification for the Ray Job API. You can use this to generate client libraries for other languages. View the...

[RFC] [Job Submission] list_jobs REST API #22412 - GitHub

This allows users to see which jobs are currently running. Currently there is no way to do this; the user would have to...

AWS Glue API - AWS Documentation

API Reference for the AWS Glue service. ... Language SDK libraries allow you to access AWS resources from common programming ... User-defined Function...

Xray REST API - JFrog - JFrog Documentation

JFrog Platform Cloud offers the same extensive functionality and ... Description: Update the time of the DB sync daily update job.

Textio – Interrupt bias in performance feedback and recruiting

Textio brings the world's most advanced inclusive language guidance to your work. Interrupt bias in real time with powerful DEIB data insights and...