[Feature] Feedback on Ray Job API
See original GitHub issueHi everyone,
I’ve been using the Ray Job API and ray up as part of my work build a Ray scheduler for torchX
I will preface the below by saying I think the Ray Job API is a great feature but I think it needs a bit more work to make it as great to use as the rest of Ray.
I had some feedback to improve its usability and one major limitation I wanted to point out that @amogkam @jiaodong suggested I share.
Major limitation
Right now it seems like working_dir
has a 100MB limit which is problematic since it means I don’t have a clean story to load training data, binaries and extra files like label key value stores. This limits how “real” of a usecase this feature can support because even introductory Kaggle problems have datasets larger than 100MB.
Minor limitations and suggested improvements
- The autoscaler API returns a dashboard url but doesn’t seem to also return a port so this means I need to figure out the port by looking at the cluster.yaml and this may be solvable by using something like a jinja template but it’s limiting
- No connection to placement groups - if I’m submitting more than one job or if there are multiple people using a cluster then I need to constrain the resource requirements for a job - in the torchX Ray scheduler this is done via a driver script but it does feel strange that this doesn’t naturally interop with existing Ray primitives. Ideally I just want a decorator or just an extra parameter in the job submission API to constrain resources
- I may be wrong but I don’t see any way of setting up a VPC in a private network which limits enterprise adoption of this feature https://gist.github.com/msaroufim/8623ead1b2dc75b09dbf2847330240e5
- Not a Ray specific concern but when using my personal AWS account I needed to request an instance limit increase which after 48 is still not resolved. I wonder if this process can be streamlined assuming people put a reasonable credit card in the AWS console
requirements.txt
is not supported and need to turn it into a list. I have the code to do this and can contribute it- Support for multiple directories - in my case I had to create a temporary directory, copy everything in, put that in a working directory and then delete when done
- Polling could just use a native function since it’s almost impossible to imagine using the job scheduler without it https://docs.ray.io/en/releases-1.9.0/ray-job-submission/overview.html#rest-api - alternatively would be great to implement it as async promise that gets fulfilled when a job succeeds or fails
- JobStatus enum could be expanded to include
queued
which would be especially useful if I’m submitting several jobs or several people are submitting the same job to the cluster https://github.com/ray-project/ray/blob/master/dashboard/modules/job/common.py#L39 - More of request to
ray up
but I wish there was an easy way to share the autoscaler ssh keys so multiple people can use the cluster I setup - A more natural way to submit distributed jobs - again maybe torchX with a driver script is the answer but submitting a job with environment variables as a baseline would make a lot of things simpler since for example pytorch distributed requires setting some environment variables like rank and world size
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Ray Jobs REST API — Ray 2.2.0 - the Ray documentation
We provide an OpenAPI specification for the Ray Job API. You can use this to generate client libraries for other languages. View the...
Read more >[RFC] [Job Submission] list_jobs REST API #22412 - GitHub
This allows users to see which jobs are currently running. Currently there is no way to do this; the user would have to...
Read more >AWS Glue API - AWS Documentation
API Reference for the AWS Glue service. ... Language SDK libraries allow you to access AWS resources from common programming ... User-defined Function...
Read more >Xray REST API - JFrog - JFrog Documentation
JFrog Platform Cloud offers the same extensive functionality and ... Description: Update the time of the DB sync daily update job.
Read more >Textio – Interrupt bias in performance feedback and recruiting
Textio brings the world's most advanced inclusive language guidance to your work. Interrupt bias in real time with powerful DEIB data insights and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m particularly excited about the ray job API as an opportunity to move from clusters dedicated to a specific project to a generic cluster that could handle multiple workloads across the entire team.
Allow me to add a couple of points:
ray job list
outputting a table would be cool to check queued and running jobs. []sThanks @edoakes
This actually sounds fine, might just need to mention this in docs or source code for
working_dir
Added more detail here, there’ scripts like in PyTorch distributed where you need to set environment variables like rank and world size that. Had to do this to make things work