Ray on a cluster: ConnectionError: Could not find any running Ray instance
See original GitHub issueI’m trying to test ray on a university cluster with the code below
import ray ray.init(address=“auto”) import time
@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address() set(ray.get([f.remote() for _ in range(1000)]))
But it returns error like this. Did I use ray in a wrong way or what?
File “<stdin>”, line 2, in <module>
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py”, line 643, in init
address, redis_address)
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 273, in validate_redis_address
address = find_redis_address_or_die()
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 165, in find_redis_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address
.
Issue Analytics
- State:
- Created 3 years ago
- Comments:33 (5 by maintainers)
I have managed to have Ray run on a PBS cluster using the following script
with startWorkerNode.sh being
Within Script.py, I have
where the Redis password is retrieved through argparse.
Hope that helps. 😃
Great news–Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:
The startWorkerNode.sh script:
Note that for the PBS cluster I’m using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file
I hope this is a general solution for everyone. I finally made it work with huge help from my uni’s HPC specialist