Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray starts too many workers (and may crash) when using nested remote functions.

See original GitHub issue

This is very similar to the earlier issue #231. One proposed solution was implemented by @stephanie-wang in https://github.com/ray-project/ray/pull/425.

Users sometimes encounter variants of the following bug and have no idea what is going wrong.

Running the following workload requires about 500 workers to be started (to execute all of the g tasks which are blocked in the call to ray.get before the f tasks start getting executed.

import ray
ray.init()

@ray.remote
def f():
    return 1

@ray.remote
def g():
    return sum(ray.get([f.remote() for _ in range(10)]))

ray.get([g.remote() for _ in range(500)])

Workarounds:

Start fewer g tasks

Divide g into two parts, e.g.,

@ray.remote
def g_part_a():
    return [f.remote() for _ in range(10)]

@ray.remote
def g_part_b(*results):
    return sum(results)

intermediate = ray.get([g_part_a.remote() for _ in range(500)])
ray.get([g_part_b.remote(*ids) for ids in intermediate])

Use custom resources to constrain the number of g tasks running concurrently (suggested by @ericl).

Potential Solutions:

Make the scheduler prioritize the f tasks over the g tasks (e.g., the strategy in #425 or some sort of LIFO policy.

In the meantime, we can easily detect that we’ve started way too many workers and push a warning to the user with a link to some possible workaround.

cc @stephanie-wang @ericl

Issue Analytics

State:
Created 5 years ago
Comments:19 (2 by maintainers)

Top GitHub Comments

6reactions

Lewisracingcommented, Nov 22, 2020

Hi is this issue improved in the latest ray verisions (e.g. 1.1.0)?

4reactions

robertnishiharacommented, May 2, 2020

@nmayhall-vt there is something a little subtle happening here.

If you are doing something like

x_id = ray.put(x)
y_id = ray.put(y)

@ray.remote
def process(data):
    x = ray.get(data[0])
    y = ray.get(data[1])

    # Use x and y

process.remote([x_id, y_id])

Then you should be able to avoid the issue by doing

x_id = ray.put(x)
y_id = ray.put(y)

@ray.remote
def process(x, y):
    # Use x and y

process.remote(x_id, y_id)

The reason is that when you call ray.get inside of a remote function, Ray will treat the task as “not using any resources” until ray.get returns, and so will potentially schedule additional tasks, which may require additional workers to be created.do this,

Does that make sense?

Also, is it giving the warning and then crashing? Or is it giving the warning and then succeeding?

Top Results From Across the Web

On Ray issue "Ray starts too many workers (and may crash ...

Recently, I ran into a ray crash issue and found Ray issue #3644 on github “Ray starts too many workers (and may crash)...

[Ray Tune] Ray crashes and system hangs - Google Groups

I can kickoff Ray and run about ~30 jobs before Ray crashes and my system hangs. I gather from dmesg as well as...

ray_tutorial.py

Turn it into a remote function using the `@ray.remote` decorator. ... We do this because workers may still be starting up in the...

Ray not distributing workers evenly amongst machines

It would help to know what the workload you're trying to run is, but in general, you can encourage tasks to be spread...

Ray Documentation - Read the Docs

command line, so make sure this is the version you wish to use. 1.5 Tutorial ... Note that a remote function can return...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Ray starts too many workers (and may crash) when using nested remote functions.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ray.init: can't find 'total memory' in 'vmstat' output

Crash with message node --- has been marked dead because the monitor has missed too many heartbeats from it