question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray on a cluster: ConnectionError: Could not find any running Ray instance

See original GitHub issue

I’m trying to test ray on a university cluster with the code below

import ray ray.init(address=“auto”) import time

@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address() set(ray.get([f.remote() for _ in range(1000)]))

But it returns error like this. Did I use ray in a wrong way or what?

File “<stdin>”, line 2, in <module> File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py”, line 643, in init address, redis_address) File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 273, in validate_redis_address address = find_redis_address_or_die() File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 165, in find_redis_address_or_die "Could not find any running Ray instance. " ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:33 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
Patol75commented, Sep 1, 2020

I have managed to have Ray run on a PBS cluster using the following script

#!/bin/bash
#PBS -l ncpus=192
#PBS -l mem=600GB
#PBS -l walltime=48:00:00
#PBS -l wd

module load python3/3.7.4

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)

echo parameters: $ip_head $redis_password

/path/to/ray start --head --port=6379 \
--redis-password=$redis_password \
--num-cpus 48 --num-gpus 0
sleep 10

for (( n=48; n<$PBS_NCPUS; n+=48 ))
do
  pbsdsh -n $n -v /path/to/startWorkerNode.sh \
  $ip_head $redis_password &
  sleep 10
done

cd /path/to/working/directory || exit
./Script.py --pw $redis_password

/path/to/ray stop

with startWorkerNode.sh being

#!/bin/bash -l

module load python3/3.7.4

/path/to/ray start --block --address=$1 \
--redis-password=$2 --num-cpus 48 --num-gpus 0

/path/to/ray stop

Within Script.py, I have

ray.init(address='auto', redis_password=args.pw)

where the Redis password is retrieved through argparse.

Hope that helps. 😃

1reaction
Lewisracingcommented, Dec 2, 2020

Great news–Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:

#!/bin/bash

#PBS -N pythoncpu_testray
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q five_day
#PBS -m abe
#PBS -M xxx@xxx.xx  
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n

ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID

cd $PBS_O_WORKDIR

jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
 
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
 
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
 
echo "set up ray cluster..." 
for n in `echo ${jobnodes}`
do
        if [[ ${n} == "${thishost}" ]]
        then
                echo "first allocate node - use as headnode ..."
                module load PyTorch
                ray start --head
                sleep 5
        else
                ssh ${n}  $PBS_O_WORKDIR/startWorkerNode.sh ${thishostNport}
                sleep 10
        fi
done 
 
python <Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

The startWorkerNode.sh script:

#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load PyTorch
ray start --address="${param1}" --redis-password='5241590000000000'

Note that for the PBS cluster I’m using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file

chmod +x startWorkerNode.sh

I hope this is a general solution for everyone. I finally made it work with huge help from my uni’s HPC specialist

Read more comments on GitHub >

github_iconTop Results From Across the Web

head” succeed but "ray status" could not find any running ray ...
On the cluster I am using, there is one management node and many compute nodes. The management node has no GPU, the compute...
Read more >
Ray on AWS: Could not find any running Ray instance
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` ...
Read more >
Configuring Ray Address - Anyscale Docs
To connect to Anyscale, set the Ray Address to a string starting with anyscale:// . Ray Address can be specified as an argument...
Read more >
Tips on installing and maintaining Ray Cluster - Medium
Also, it is important to make sure that the Ray version is the same in all the machines otherwise it shows version error...
Read more >
Amazon EKS troubleshooting - AWS Documentation
This chapter covers some common errors that you may see while using Amazon EKS ... Cannot create cluster 'example-cluster' because region-1d, the targeted ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found