[ray] worker_start_ray_commands are not executed for private cluster
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
- Ray installed from (source or binary): pip
- Ray version: 0.7.0
- Python version: 3.6.7
- Exact command to reproduce:
Describe the problem
I am following private cluster setup instructions, but only head node starts. Few interesting points:
- Seems similar to issue https://github.com/ray-project/ray/issues/3408
- Adding
initialization_commands: []
fixes theKeyError
mentioned in https://github.com/ray-project/ray/issues/4559
Source code / logs
cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
type: local
head_ip: ip1
worker_ips: [ip2, ip3, ip4]
auth:
ssh_user: tesq
ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- source activate py3_prod && ray stop
- echo 'I am here' >> /home/tesq/new_file.txt
- source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
- echo 'I am there' >> /home/tesq/new_file.txt
- source activate py3_prod && ray stop
- echo 'I am there' >> /home/tesq/new_file.txt
- source activate py3_prod && ray start --redis-address=ip1:6379
After that only head node starts, and only on the head node I see the created file new_file.txt
Example output of command ray.global_state.client_table()
{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
'IsInsertion': True,
'NodeManagerAddress': 'ip1',
'NodeManagerPort': 45759,
'ObjectManagerPort': 34107,
'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
'Resources': {'GPU': 3.0, 'CPU': 24.0}},
Update:
Seems very similar to issue https://github.com/ray-project/ray/issues/3190
But files monitor.err
and monitor.out
are empty.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:21 (8 by maintainers)
Top Results From Across the Web
Ray k8s cluster, cannot run new task when previous task failed
Hi, I launched a Ray k8s cluster and debug my code with the cluster. However, when my task failed and I restarted the...
Read more >Scaling Applications on Kubernetes with Ray | by Vishnu Deva
The process of connecting these new pods to the Ray Cluster happens using the headStartRayCommands and workerStartRayCommands. These commands ...
Read more >Autoscaling of the Ray node type worker nodes - Stack Overflow
I have created ray cluster with 1 headtype node and worker type ... is fully consumed, no worker node is created by kubernetes...
Read more >Manage private clusters in Cloud Code for VS Code
To manage/delete the instances that you created, see VM instances. To successfully connect to the private cluster, Cloud Code must be running on...
Read more >Ensure private cluster is enabled when creating Kubernetes ...
This is achieved as the nodes have internal RFC 1918 IP addresses only. In private clusters, the cluster master has private and public...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ijrsvt I’m under the company’s firewall, sorry will not be able to post the complete YAML.
I got the YAML from here and updated the
head_ip
worker_ips
andssh_user
.When I run the command
ray up config.yaml
it brings up the ray on the head_ip as head node and alsoWhereas upon manually running the command
ray start --address=head_ip:port
on each of the worker machine, the worker nodes gets added to the cluster.So may be if you could share a working YAML-which can bring up ray on a head node and worker nodes, i could use that as a reference. appreciate your help. – thanks
ray v0.8.4 python 3.6.9 Ubuntu 18.04.4 I’m running into this same thing, none of the commands (setup_commands) or (worker_start_ray_commands) appear to be executing.
I guess it might not be obvious, but that includes the bit about starting up the worker clients. Basically only the head node is launched, none of the workers appear to be executing any commands “ray start” or otherwise.