Ray starts too many workers (and may crash) when using nested remote functions.
See original GitHub issueThis is very similar to the earlier issue #231. One proposed solution was implemented by @stephanie-wang in https://github.com/ray-project/ray/pull/425.
Users sometimes encounter variants of the following bug and have no idea what is going wrong.
Running the following workload requires about 500 workers to be started (to execute all of the g
tasks which are blocked in the call to ray.get
before the f
tasks start getting executed.
import ray
ray.init()
@ray.remote
def f():
return 1
@ray.remote
def g():
return sum(ray.get([f.remote() for _ in range(10)]))
ray.get([g.remote() for _ in range(500)])
Workarounds:
- Start fewer
g
tasks - Divide
g
into two parts, e.g.,@ray.remote def g_part_a(): return [f.remote() for _ in range(10)] @ray.remote def g_part_b(*results): return sum(results) intermediate = ray.get([g_part_a.remote() for _ in range(500)]) ray.get([g_part_b.remote(*ids) for ids in intermediate])
- Use custom resources to constrain the number of
g
tasks running concurrently (suggested by @ericl).
Potential Solutions:
- Make the scheduler prioritize the
f
tasks over theg
tasks (e.g., the strategy in #425 or some sort of LIFO policy.
In the meantime, we can easily detect that we’ve started way too many workers and push a warning to the user with a link to some possible workaround.
Issue Analytics
- State:
- Created 5 years ago
- Comments:19 (2 by maintainers)
Top Results From Across the Web
On Ray issue "Ray starts too many workers (and may crash ...
Recently, I ran into a ray crash issue and found Ray issue #3644 on github “Ray starts too many workers (and may crash)...
Read more >[Ray Tune] Ray crashes and system hangs - Google Groups
I can kickoff Ray and run about ~30 jobs before Ray crashes and my system hangs. I gather from dmesg as well as...
Read more >ray_tutorial.py
Turn it into a remote function using the `@ray.remote` decorator. ... We do this because workers may still be starting up in the...
Read more >Ray not distributing workers evenly amongst machines
It would help to know what the workload you're trying to run is, but in general, you can encourage tasks to be spread...
Read more >Ray Documentation - Read the Docs
command line, so make sure this is the version you wish to use. 1.5 Tutorial ... Note that a remote function can return...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi is this issue improved in the latest ray verisions (e.g. 1.1.0)?
@nmayhall-vt there is something a little subtle happening here.
If you are doing something like
Then you should be able to avoid the issue by doing
The reason is that when you call
ray.get
inside of a remote function, Ray will treat the task as “not using any resources” untilray.get
returns, and so will potentially schedule additional tasks, which may require additional workers to be created.do this,Does that make sense?
Also, is it giving the warning and then crashing? Or is it giving the warning and then succeeding?