question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray starts too many workers (and may crash) when using nested remote functions.

See original GitHub issue

This is very similar to the earlier issue #231. One proposed solution was implemented by @stephanie-wang in https://github.com/ray-project/ray/pull/425.

Users sometimes encounter variants of the following bug and have no idea what is going wrong.

Running the following workload requires about 500 workers to be started (to execute all of the g tasks which are blocked in the call to ray.get before the f tasks start getting executed.

import ray
ray.init()

@ray.remote
def f():
    return 1

@ray.remote
def g():
    return sum(ray.get([f.remote() for _ in range(10)]))

ray.get([g.remote() for _ in range(500)])

Workarounds:

  • Start fewer g tasks
  • Divide g into two parts, e.g.,
    @ray.remote
    def g_part_a():
        return [f.remote() for _ in range(10)]
    
    @ray.remote
    def g_part_b(*results):
        return sum(results)
    
    intermediate = ray.get([g_part_a.remote() for _ in range(500)])
    ray.get([g_part_b.remote(*ids) for ids in intermediate])
    
  • Use custom resources to constrain the number of g tasks running concurrently (suggested by @ericl).

Potential Solutions:

  • Make the scheduler prioritize the f tasks over the g tasks (e.g., the strategy in #425 or some sort of LIFO policy.

In the meantime, we can easily detect that we’ve started way too many workers and push a warning to the user with a link to some possible workaround.

cc @stephanie-wang @ericl

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:19 (2 by maintainers)

github_iconTop GitHub Comments

6reactions
Lewisracingcommented, Nov 22, 2020

Hi is this issue improved in the latest ray verisions (e.g. 1.1.0)?

4reactions
robertnishiharacommented, May 2, 2020

@nmayhall-vt there is something a little subtle happening here.

If you are doing something like

x_id = ray.put(x)
y_id = ray.put(y)

@ray.remote
def process(data):
    x = ray.get(data[0])
    y = ray.get(data[1])

    # Use x and y

process.remote([x_id, y_id])

Then you should be able to avoid the issue by doing

x_id = ray.put(x)
y_id = ray.put(y)

@ray.remote
def process(x, y):
    # Use x and y

process.remote(x_id, y_id)

The reason is that when you call ray.get inside of a remote function, Ray will treat the task as “not using any resources” until ray.get returns, and so will potentially schedule additional tasks, which may require additional workers to be created.do this,

Does that make sense?

Also, is it giving the warning and then crashing? Or is it giving the warning and then succeeding?

Read more comments on GitHub >

github_iconTop Results From Across the Web

On Ray issue "Ray starts too many workers (and may crash ...
Recently, I ran into a ray crash issue and found Ray issue #3644 on github “Ray starts too many workers (and may crash)...
Read more >
[Ray Tune] Ray crashes and system hangs - Google Groups
I can kickoff Ray and run about ~30 jobs before Ray crashes and my system hangs. I gather from dmesg as well as...
Read more >
ray_tutorial.py
Turn it into a remote function using the `@ray.remote` decorator. ... We do this because workers may still be starting up in the...
Read more >
Ray not distributing workers evenly amongst machines
It would help to know what the workload you're trying to run is, but in general, you can encourage tasks to be spread...
Read more >
Ray Documentation - Read the Docs
command line, so make sure this is the version you wish to use. 1.5 Tutorial ... Note that a remote function can return...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found