ray.wait() doesn't return methods completed by dead actors as ready
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): source
- Ray version: 0.6.1
- Python version: 3.6.6
Describe the problem
- launch an actor on another node
x = actor.ping.remote()
- kill the node containing the actor
ray.wait([x], timeout=0)
.x
will never become ready, even if called much later
Expected behavior is that x
will become ready and store an exception.
This is an issue when adding heartbeats for actors on multiple node using ray.wait()
, such as for distributed SGD.
Source code / logs
import time
import ray
from ray.test.cluster_utils import Cluster
cluster = Cluster(True, True, head_node_args={"num_cpus": 0})
node = cluster.add_node()
@ray.remote(num_cpus=1)
class Foo:
def ping(self):
pass
f = Foo.remote()
print("pinging")
ray.get(f.ping.remote())
x = f.ping.remote()
print("removing node")
cluster.remove_node(node)
print("done removing node")
for i in range(100):
print(i, ray.wait([x], timeout=1))
time.sleep(1)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:15 (14 by maintainers)
Top Results From Across the Web
ray.actor — Ray 2.2.0 - the Ray documentation
# Create objects to wrap method invocations. This is done so that we can # invoke methods with actor.method.remote() instead of actor.method(). @PublicAPI ......
Read more >How do I wait for ray on Actor class? - Stack Overflow
ray.wait returns two lists, a list of objects that are ready, and a list of objects that may or may not be ready....
Read more >Getting started with Ray in Python! - Deepnote
We need to call ray.get() if we want the results of the function (even though this function doesn't actually do anything).
Read more >Starting Ray - | notebook.community
Note: this approach is limited to a single machine. This can be done as follows. In [2]:. ray.init(). Waiting for redis server at...
Read more >Ray Documentation - Read the Docs
We can schedule tasks on the actor by calling its methods. a1.increment.remote() # ray.get returns 1 a2.increment.remote() # ray.get returns ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, this only happens with the timeout which is non-blocking. If
ray.wait
blocks on the ObjectID, then the behavior is as expected.Nice!
On Tue, Jul 23, 2019, 11:55 AM Stephanie Wang notifications@github.com wrote: