Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discuss on supporting long running job

See original GitHub issue

Current Approach

Currently, the most jobs of the production environment are long running in our experiences. Because it’s often used, it’s very important to add an appropriate and easy-to-use API for this.

I have noticed that @simon-mo implemented a relevant feature called detached actor in the PR #6036. But there might be some issues in detached actor:

It doesn’t work for normal tasks.
The cost that users rewrite a normal job to a long running job is too expensive. They should add the detached_actor flags for every actor creation.(Please correct me if I’m wrong.)
Because users should do a dummy call and get for the actor to make sure it is created, users have to know more about the details of the ray. It’s not in line with Simply principle.

Another Proposal

According to what we have been using for a long time, we’d like to support this in another approach.

Add a flag clean_up for the ray.shutdown() method to indicate whether we will clean up everything of this job.

# It will not clean up the things of this job even if this driver exits immediately.
ray.shutdown(clean_up=False)

Then there’re 2 ways to drop the job from cluster if we want:

# execute a drop command
ray drop address="redis_address" job-id=xxxxxx

or drop it in another job with ray.drop API:

ray.init(address="xxxx")
ray.drop(job_id=xxxxx)

ps: It will be more natural if we enable job-name for a job.

If you think the API ray.shutdown(clean_up=False) is a bit weird, it will make more sense to put the flag to ray.init like:

ray.init(long_running=True)

Any other proposal is welcome.

@simon-mo @stephanie-wang @robertnishihara cc

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:36 (33 by maintainers)

Top GitHub Comments

4reactions

ericlcommented, Jan 7, 2020

I mean the name to be assigning a name to the job. How about this:

ray.init(address="..", job_name="my job name")
ray.shutdown(detach=True)

1reaction

ericlcommented, Dec 20, 2021

I think we should just align the job ids between “Ray jobs” and “job server”. That will, kill will work as you’d expect in both cases.

Top Results From Across the Web

Support for long-running workers - Android Developers

Support for long-running workers ... Request app permissions · Explain access to more sensitive information · App permissions best practices ...

What are long running processes? - Bernd Ruecker's blog

Must a long running process be a business process spanning days, weeks, months or even years? Or can it solely be an automated...

Dealing with long-running jobs - Haishi's Blog

Many systems need to deal with long-running jobs. Long-running jobs are slow, and they often take up lots of system resources.

Long Running Jobs - SST docs

Job — a construct that creates the necessary infrastructure. JobHandler — a handler function that wraps around your function code in a typesafe ......

long running job - SAS Support Communities

We are struglling to improve performance for our production job, please help. We started running many more jobs into produciton than before ...