`ray.get` on cluster mode sometimes does not return
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
- Ray installed from (source or binary): wheels
- Ray version: 0.8.0.dev3
- Python version: 3.6
- Exact command to reproduce: With 2 nodes:
In [1]: import ray
In [2]: ray.init(redis_address="localhost:6379")
2019-08-01 02:57:18,898 WARNING worker.py:1372 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Out[2]:
{'node_ip_address': '172.31.95.217',
'redis_address': '172.31.95.217:6379',
'object_store_address': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/plasma_store',
'raylet_socket_name': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/raylet',
'webui_url': None,
'session_dir': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867'}
In [3]: @ray.remote
...: def test():
...: print("hello!")
...: return 123
...:
...:
In [4]: ray.get(test.remote())
(pid=2896) hello!
Out[4]: 123
In [5]: ray.get(test.remote())
(pid=2833, ip=172.31.89.59) hello!
Sometimes, ray.get
does not return.
# An unique identifier for the head node and workers of this cluster.
cluster_name: sgd-pytorch
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 1
initial_workers: 1
max_workers: 1
target_utilization_fraction: 0.9
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20
provider:
type: aws
region: us-east-1
availability_zone: us-east-1f
auth:
ssh_user: ubuntu
head_node:
InstanceType: c5.xlarge
ImageId: ami-0d96d570269578cd7
worker_nodes:
InstanceType: c5.xlarge
ImageId: ami-0d96d570269578cd7
setup_commands:
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
file_mounts: {}
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# # Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
[ray] Clustering issue · Issue #4393 · ray-project/ray - GitHub
I tried "manual cluster setup" on gcp instances, but always fail. ... Exception: Redis has started but no raylets have registered yet.
Read more >Ray Client: Interactive Development — Ray 2.2.0
Ray Client is useful for developing interactively in a local Python shell. However, it requires a stable connection to the remote cluster and...
Read more >Ray Documentation - Read the Docs
return 1 ray.init() results = ray.get([f.remote() for i in. ˓→range(4)]). To launch a Ray cluster, either privately, on AWS, or on GCP, ...
Read more >ray_tutorial.py
By default, Ray does not schedule more tasks concurrently than there are CPUs. This example requires four tasks to run concurrently, so we...
Read more >Autoscaling clusters with Ray - Anyscale
First, from the CLI, Ray will use the Ray Cluster Launcher to launch the head node of the cluster. To do this, the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I traced down the real issue. So this bug only occurs after https://github.com/ray-project/ray/pull/5120. However, the real issue seems to be somehow
ray stop
is unable to stop thedefault_worker
andraylet
processes. I’m suspecting thatkill
failed for some reason; or gRPC thread doesn’t respond to signals correctly.Checking…
Steps to reproduce the ray stop bug
https://gist.github.com/simon-mo/7194824e161f336d699d9a0bcb65c13e