question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[core] `ray.wait` doesn't return ready objects until they are local

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.1dev

I believe ray.wait() is supposed to return objects once they are ready anywhere in the cluster. Right now, it seems that it only returns ready objects once they have been pulled to the local node.

At minimum, the docs should be updated to reflect this.

Probably, we should also add an option to support returning ready objects once they are ready anywhere. Now that distributed ref counting is implemented, it seems like the buggy raylet-based ray.wait implementation could actually be removed completely and the worker can just check if the value is available in its local memory store.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

This script creates a remote object, then a local object which takes up all the memory in the local node’s object store. This causes ray.wait on the remote object to hang even though the remote object has been created, because the remote object cannot be fetched to the local node.

import numpy as np

import ray
from ray.cluster_utils import Cluster
import pytest


cluster = Cluster()
# Force task onto the second node.
cluster.add_node(num_cpus=0, object_store_memory=75 * 1024 * 1024)
cluster.add_node(object_store_memory=75 * 1024 * 1024)

ray.init(cluster.address)

@ray.remote
def put():
    return np.random.rand(5 * 1024 * 1024)  # 40 MB data

local_ref = ray.put(np.random.rand(5 * 1024 * 1024))
print("local", local_ref)
remote_ref = put.remote()
print("remote", remote_ref)

ray.wait([remote_ref], num_returns=1)
print("----")
with pytest.raises(ray.exceptions.GetTimeoutError):
    ray.get(remote_ref, timeout=1)
print("----")
del local_ref
ray.get(remote_ref)

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
simon-mocommented, Dec 2, 2020

@stephanie-wang are you planning to take this? @alindkhare might be interested working on this if you leave some pointers on how to implement it.

cc @atumanov

0reactions
iychengcommented, Dec 4, 2020

Sorry for the delay. I just added the test to test_basics.py. But somehow I notice I can’t add reviewer for this PR. Please help have a review of this one. @ericl @stephanie-wang

Read more comments on GitHub >

github_iconTop Results From Across the Web

ray.wait(fetch_local=False) in asyncio
It takes requests, returns a reference immediately, and then forwards the request to an actor at some unknown time in the future. The...
Read more >
Programming in Ray: Tips for first-time users - RISE Lab
By default it returns one ready object ID at a time. ready_ids, not_ready_ids = ray.wait(object_ids). Table 1: The core Ray API ...
Read more >
Ray Tips and Tricks, Part I — ray.wait | by Dean Wampler
By default it returns one ready object ID at a time. (doc); ray.shutdown() — Disconnect the worker, and terminate processes started by ray.init ......
Read more >
Ray Tutorial | A Quest After Perspectives
The second task will not be executed until the first task has ... So ray.wait will block until either num_returns objects are ready...
Read more >
Ray object store running out of memory using out of core. How ...
As of ray 1.2.0, the object spilling to support out-of-core data processing is supported. Fro 1.3+ (which will be released in 3 weeks), ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found