question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discuss on supporting long running job

See original GitHub issue

Current Approach

Currently, the most jobs of the production environment are long running in our experiences. Because it’s often used, it’s very important to add an appropriate and easy-to-use API for this.

I have noticed that @simon-mo implemented a relevant feature called detached actor in the PR #6036. But there might be some issues in detached actor:

  1. It doesn’t work for normal tasks.
  2. The cost that users rewrite a normal job to a long running job is too expensive. They should add the detached_actor flags for every actor creation.(Please correct me if I’m wrong.)
  3. Because users should do a dummy call and get for the actor to make sure it is created, users have to know more about the details of the ray. It’s not in line with Simply principle.

Another Proposal

According to what we have been using for a long time, we’d like to support this in another approach.

Add a flag clean_up for the ray.shutdown() method to indicate whether we will clean up everything of this job.

# It will not clean up the things of this job even if this driver exits immediately.
ray.shutdown(clean_up=False)

Then there’re 2 ways to drop the job from cluster if we want:

# execute a drop command
ray drop address="redis_address" job-id=xxxxxx

or drop it in another job with ray.drop API:

ray.init(address="xxxx")
ray.drop(job_id=xxxxx)

ps: It will be more natural if we enable job-name for a job.

If you think the API ray.shutdown(clean_up=False) is a bit weird, it will make more sense to put the flag to ray.init like:

ray.init(long_running=True)

Any other proposal is welcome.

@simon-mo @stephanie-wang @robertnishihara cc

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:36 (33 by maintainers)

github_iconTop GitHub Comments

4reactions
ericlcommented, Jan 7, 2020

I mean the name to be assigning a name to the job. How about this:

ray.init(address="..", job_name="my job name")
ray.shutdown(detach=True)
1reaction
ericlcommented, Dec 20, 2021

I think we should just align the job ids between “Ray jobs” and “job server”. That will, kill will work as you’d expect in both cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Support for long-running workers - Android Developers
Support for long-running workers ... Request app permissions · Explain access to more sensitive information · App permissions best practices ...
Read more >
What are long running processes? - Bernd Ruecker's blog
Must a long running process be a business process spanning days, weeks, months or even years? Or can it solely be an automated...
Read more >
Dealing with long-running jobs - Haishi's Blog
Many systems need to deal with long-running jobs. Long-running jobs are slow, and they often take up lots of system resources.
Read more >
Long Running Jobs - SST docs
Job — a construct that creates the necessary infrastructure. JobHandler — a handler function that wraps around your function code in a typesafe ......
Read more >
long running job - SAS Support Communities
We are struglling to improve performance for our production job, please help. We started running many more jobs into produciton than before ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found