Discuss on supporting long running job
See original GitHub issueCurrent Approach
Currently, the most jobs of the production environment are long running in our experiences. Because it’s often used, it’s very important to add an appropriate and easy-to-use API for this.
I have noticed that @simon-mo implemented a relevant feature called detached actor
in the PR #6036. But there might be some issues in detached actor
:
- It doesn’t work for normal tasks.
- The cost that users rewrite a normal job to a long running job is too expensive. They should add the
detached_actor
flags for every actor creation.(Please correct me if I’m wrong.) - Because users should do a
dummy call and get
for the actor to make sure it is created, users have to know more about the details of the ray. It’s not in line withSimply
principle.
Another Proposal
According to what we have been using for a long time, we’d like to support this in another approach.
Add a flag clean_up
for the ray.shutdown()
method to indicate whether we will clean up everything of this job.
# It will not clean up the things of this job even if this driver exits immediately.
ray.shutdown(clean_up=False)
Then there’re 2 ways to drop the job from cluster if we want:
# execute a drop command
ray drop address="redis_address" job-id=xxxxxx
or drop it in another job with ray.drop
API:
ray.init(address="xxxx")
ray.drop(job_id=xxxxx)
ps: It will be more natural if we enable job-name for a job.
If you think the API ray.shutdown(clean_up=False)
is a bit weird, it will make more sense to put the flag to ray.init
like:
ray.init(long_running=True)
Any other proposal is welcome.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:36 (33 by maintainers)
Top GitHub Comments
I mean the name to be assigning a name to the job. How about this:
I think we should just align the job ids between “Ray jobs” and “job server”. That will, kill will work as you’d expect in both cases.