question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ray] worker_start_ray_commands are not executed for private cluster

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): pip
  • Ray version: 0.7.0
  • Python version: 3.6.7
  • Exact command to reproduce:

Describe the problem

I am following private cluster setup instructions, but only head node starts. Few interesting points:

Source code / logs

cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
    type: local
    head_ip: ip1
    worker_ips: [ip2, ip3, ip4]
auth:
    ssh_user: tesq
    ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - source activate py3_prod && ray stop
    - echo 'I am here' >> /home/tesq/new_file.txt
    - source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray stop
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray start --redis-address=ip1:6379

After that only head node starts, and only on the head node I see the created file new_file.txt Example output of command ray.global_state.client_table()

{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
  'IsInsertion': True,
  'NodeManagerAddress': 'ip1',
  'NodeManagerPort': 45759,
  'ObjectManagerPort': 34107,
  'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
  'Resources': {'GPU': 3.0, 'CPU': 24.0}},

Update: Seems very similar to issue https://github.com/ray-project/ray/issues/3190 But files monitor.err and monitor.out are empty.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
solaceracecommented, Jun 24, 2020

@ijrsvt I’m under the company’s firewall, sorry will not be able to post the complete YAML.

I got the YAML from here and updated the head_ip worker_ips and ssh_user.

When I run the command ray up config.yaml it brings up the ray on the head_ip as head node and also

  • prints the command to add additional node to the cluster
  • prints the UI address
  • but does not brings up the ray on the worker nodes

Whereas upon manually running the command ray start --address=head_ip:port on each of the worker machine, the worker nodes gets added to the cluster.

So may be if you could share a working YAML-which can bring up ray on a head node and worker nodes, i could use that as a reference. appreciate your help. – thanks

2reactions
gimzmoecommented, Apr 20, 2020

ray v0.8.4 python 3.6.9 Ubuntu 18.04.4 I’m running into this same thing, none of the commands (setup_commands) or (worker_start_ray_commands) appear to be executing.

I guess it might not be obvious, but that includes the bit about starting up the worker clients. Basically only the head node is launched, none of the workers appear to be executing any commands “ray start” or otherwise.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ray k8s cluster, cannot run new task when previous task failed
Hi, I launched a Ray k8s cluster and debug my code with the cluster. However, when my task failed and I restarted the...
Read more >
Scaling Applications on Kubernetes with Ray | by Vishnu Deva
The process of connecting these new pods to the Ray Cluster happens using the headStartRayCommands and workerStartRayCommands. These commands ...
Read more >
Autoscaling of the Ray node type worker nodes - Stack Overflow
I have created ray cluster with 1 headtype node and worker type ... is fully consumed, no worker node is created by kubernetes...
Read more >
Manage private clusters in Cloud Code for VS Code
To manage/delete the instances that you created, see VM instances. To successfully connect to the private cluster, Cloud Code must be running on...
Read more >
Ensure private cluster is enabled when creating Kubernetes ...
This is achieved as the nodes have internal RFC 1918 IP addresses only. In private clusters, the cluster master has private and public...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found