[ray] Clustering issue
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: 0.6.4
- Python version: 3.6.8
- Exact command to reproduce:
Describe the problem
I tried “manual cluster setup” on gcp instances, but always fail.
I used ray start --head --redis-port=6379
command on head machine, and used import ray
and ray.init(redis_address="10.129.0.7:6379")
on node machine.
I attached log below It showed exception error about raylets.
I also tested ray version 0.6.3 and 0.7.0, but got the same result. There’s no communication problem to communicate each machine with redis. And all port are open.
But why cannot set up the cluster?
Source code / logs
log of head
2019-03-18 01:19:44,763 INFO scripts.py:286 -- Using IP address 10.129.0.7 for this node.
2019-03-18 01:19:44,763 INFO node.py:439 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-18_01-19-44_3587/logs.
2019-03-18 01:19:44,866 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-03-18 01:19:44,975 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:32675 to respond...
2019-03-18 01:19:44,976 INFO services.py:761 -- Starting Redis shard with 6.32 GB max memory.
2019-03-18 01:19:44,984 INFO services.py:1449 -- Starting the Plasma object store with 9.48 GB memory using /dev/shm.
2019-03-18 01:19:44,991 INFO scripts.py:317 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --redis-address 10.129.0.7:6379
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(redis_address="10.129.0.7:6379")
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
log of node 1
>>> import ray
>>> ray.init(redis_address="10.129.0.7:6379")
2019-03-18 01:21:15,265 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:16,267 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:17,271 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:18,274 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2019-03-18 01:21:19,276 WARNING worker.py:1249 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1499, in init
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1242, in get_address_info_from_redis
redis_address, node_ip_address, redis_password=redis_password)
File "/home/jason.park/.pyenv/versions/3.6.8/lib/python3.6/site-packages/ray/worker.py", line 1222, in get_address_info_from_redis_helper
"Redis has started but no raylets have registered yet.")
Exception: Redis has started but no raylets have registered yet.
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (1 by maintainers)
Top Results From Across the Web
Ray Clusters Overview — Ray 2.2.0 - the Ray documentation
A Ray cluster is a set of worker nodes connected to a common Ray head node. Ray clusters can be fixed-size, or they...
Read more >[Core][Clusters] ray start --head prints incorrect instructions for ...
EDIT: I just realized the behavior here is different on MacOS. I haven't verified if this is an issue on Linux, but there...
Read more >Using Ray on a Large Cluster — Ray 0.01 documentation
Deploying Ray on a cluster requires a bit of manual work. ... This section can be ignored unless you run into problems with...
Read more >An introduction to distributed computing using the Ray library ...
to get access to it but, unfortunately, there is a known issue that ... On the cloud, a Ray cluster consists of a...
Read more >Ray in the Google cloud – part 2 - b.telligent
YAML cluster configuration. To download the YAML file for the cluster configuration, issue the following command on the client machine: wget https://raw.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I had the same problem when manually setting up the cluster. For me, the problem is that I did not open enough ports for Ray. According to this comment, multiple ports need to be open.
I solve this problem by opening port 6379, 6380, 12345 and 12346 on all nodes.
On the head node:
On the other nodes:
Now I can connect a driver to the cluster on both head node and the other nodes:
So you call ‘ray start’ from the head node itself?
I am having trouble connecting to my cluster via python from my local machine. I am trying to (1) start the cluster from my local machine with ray up or ray start, which is successful, then (2) ray.init(redis_address=‘<ip>:<port’).
I am confident the cluster starts because I am able to run the python script with ray submit config.yaml script.py, which I understand copies the python script to the head node. However, I imagine it is possible to connect to your cluster from your local machine and make remote cluster calls?
Has anyone else experienced this? Could the above responders kindly provide some more specifics on where they are starting the cluster from, where they are running their python scripts from, etc?