Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray on a cluster: ConnectionError: Could not find any running Ray instance

See original GitHub issue

I’m trying to test ray on a university cluster with the code below

import ray ray.init(address=“auto”) import time

@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address() set(ray.get([f.remote() for _ in range(1000)]))

But it returns error like this. Did I use ray in a wrong way or what?

File “<stdin>”, line 2, in <module> File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py”, line 643, in init address, redis_address) File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 273, in validate_redis_address address = find_redis_address_or_die() File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 165, in find_redis_address_or_die "Could not find any running Ray instance. " ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.

Issue Analytics

State:
Created 3 years ago
Comments:33 (5 by maintainers)

Top GitHub Comments

2reactions

Patol75commented, Sep 1, 2020

I have managed to have Ray run on a PBS cluster using the following script

#!/bin/bash
#PBS -l ncpus=192
#PBS -l mem=600GB
#PBS -l walltime=48:00:00
#PBS -l wd

module load python3/3.7.4

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)

echo parameters: $ip_head $redis_password

/path/to/ray start --head --port=6379 \
--redis-password=$redis_password \
--num-cpus 48 --num-gpus 0
sleep 10

for (( n=48; n<$PBS_NCPUS; n+=48 ))
do
  pbsdsh -n $n -v /path/to/startWorkerNode.sh \
  $ip_head $redis_password &
  sleep 10
done

cd /path/to/working/directory || exit
./Script.py --pw $redis_password

/path/to/ray stop

with startWorkerNode.sh being

#!/bin/bash -l

module load python3/3.7.4

/path/to/ray start --block --address=$1 \
--redis-password=$2 --num-cpus 48 --num-gpus 0

/path/to/ray stop

Within Script.py, I have

ray.init(address='auto', redis_password=args.pw)

where the Redis password is retrieved through argparse.

Hope that helps. 😃

1reaction

Lewisracingcommented, Dec 2, 2020

Great news–Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:

#!/bin/bash

#PBS -N pythoncpu_testray
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q five_day
#PBS -m abe
#PBS -M xxx@xxx.xx  
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n

ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID

cd $PBS_O_WORKDIR

jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
 
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
 
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
 
echo "set up ray cluster..." 
for n in `echo ${jobnodes}`
do
        if [[ ${n} == "${thishost}" ]]
        then
                echo "first allocate node - use as headnode ..."
                module load PyTorch
                ray start --head
                sleep 5
        else
                ssh ${n}  $PBS_O_WORKDIR/startWorkerNode.sh ${thishostNport}
                sleep 10
        fi
done 
 
python <Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

The startWorkerNode.sh script:

#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load PyTorch
ray start --address="${param1}" --redis-password='5241590000000000'

Note that for the PBS cluster I’m using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file

chmod +x startWorkerNode.sh

I hope this is a general solution for everyone. I finally made it work with huge help from my uni’s HPC specialist

Top Results From Across the Web

head” succeed but "ray status" could not find any running ray ...

On the cluster I am using, there is one management node and many compute nodes. The management node has no GPU, the compute...

Ray on AWS: Could not find any running Ray instance

ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` ...

Configuring Ray Address - Anyscale Docs

To connect to Anyscale, set the Ray Address to a string starting with anyscale:// . Ray Address can be specified as an argument...

Tips on installing and maintaining Ray Cluster - Medium

Also, it is important to make sure that the Ray version is the same in all the machines otherwise it shows version error...

Amazon EKS troubleshooting - AWS Documentation

This chapter covers some common errors that you may see while using Amazon EKS ... Cannot create cluster 'example-cluster' because region-1d, the targeted ......