Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[core] `ray.wait` doesn't return ready objects until they are local

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.1dev

I believe ray.wait() is supposed to return objects once they are ready anywhere in the cluster. Right now, it seems that it only returns ready objects once they have been pulled to the local node.

At minimum, the docs should be updated to reflect this.

Probably, we should also add an option to support returning ready objects once they are ready anywhere. Now that distributed ref counting is implemented, it seems like the buggy raylet-based ray.wait implementation could actually be removed completely and the worker can just check if the value is available in its local memory store.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

This script creates a remote object, then a local object which takes up all the memory in the local node’s object store. This causes ray.wait on the remote object to hang even though the remote object has been created, because the remote object cannot be fetched to the local node.

import numpy as np

import ray
from ray.cluster_utils import Cluster
import pytest


cluster = Cluster()
# Force task onto the second node.
cluster.add_node(num_cpus=0, object_store_memory=75 * 1024 * 1024)
cluster.add_node(object_store_memory=75 * 1024 * 1024)

ray.init(cluster.address)

@ray.remote
def put():
    return np.random.rand(5 * 1024 * 1024)  # 40 MB data

local_ref = ray.put(np.random.rand(5 * 1024 * 1024))
print("local", local_ref)
remote_ref = put.remote()
print("remote", remote_ref)

ray.wait([remote_ref], num_returns=1)
print("----")
with pytest.raises(ray.exceptions.GetTimeoutError):
    ray.get(remote_ref, timeout=1)
print("----")
del local_ref
ray.get(remote_ref)

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

simon-mocommented, Dec 2, 2020

@stephanie-wang are you planning to take this? @alindkhare might be interested working on this if you leave some pointers on how to implement it.

cc @atumanov

0reactions

iychengcommented, Dec 4, 2020

Sorry for the delay. I just added the test to test_basics.py. But somehow I notice I can’t add reviewer for this PR. Please help have a review of this one. @ericl @stephanie-wang

Top Results From Across the Web

ray.wait(fetch_local=False) in asyncio

It takes requests, returns a reference immediately, and then forwards the request to an actor at some unknown time in the future. The...

Programming in Ray: Tips for first-time users - RISE Lab

By default it returns one ready object ID at a time. ready_ids, not_ready_ids = ray.wait(object_ids). Table 1: The core Ray API ...

Ray Tips and Tricks, Part I — ray.wait | by Dean Wampler

By default it returns one ready object ID at a time. (doc); ray.shutdown() — Disconnect the worker, and terminate processes started by ray.init ......

Ray Tutorial | A Quest After Perspectives

The second task will not be executed until the first task has ... So ray.wait will block until either num_returns objects are ready...

Ray object store running out of memory using out of core. How ...

As of ray 1.2.0, the object spilling to support out-of-core data processing is supported. Fro 1.3+ (which will be released in 3 weeks), ......