question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No logging of SSH failures of Ray nodes?

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 6 HPC environment
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.3
  • Python version: 3.7
  • Exact command to reproduce:

I’ve designed a custom NodeProvider to provide autoscaling support on Univa Grid Engine HPC environments. I’m having the hardest time, though, debugging errors w.r.t. permanent “waiting-for-ssh” states. I have at one point gotten a Ray head node properly qsub’d and running via ray up blarg.yml -y (which then goes on to spawn two initial workers as configured), but the workers are not able to do that dummy SSH step back to the head for whatever reason. Unfortunately, I’ve run through all the logs that Ray and the job script spit out, and have had no luck finding specifics of why the workers’ SSH efforts have failed. Is there a particular way to get the reason(s) behind the SSH failure exported to the log files?

Now that I’ve been messing with this a bit, e.g., creating new SSH keys, I am having trouble even getting the qsub’d ray start connecting to the ray up process (with the “waiting-for-ssh” problem). I’m sure it’s some configuration stuff on my end, but because I get so little feedback on the failure mode, this has been painful to say the least to debug, haha.

Command to start the head node that my NodeProvider qsubs from ray up:

ray start --block --num-cpus 1 --temp-dir /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head --head --redis-port 31236 --autoscaling-config /home/myusername/ray_cluster//autoscaling_config.yml

Note that I can see the head node job + the two worker jobs have been submitted and are running on the cluster

Log outputs:

monitor.out

This is the output from the head node’s NodeProvider (my custom debugging output). It’s too long to paste in this issue, so find here:

https://gist.github.com/zbarry/0282655b01bc51f247a7a8fa6477e9e9

Cluster / job log

[INFO] [1550510728] [Mon Feb 18 12:25:28 EST 2019] [16366] [18509843] Starting the workflow
/home/myusername/miniconda3/envs/ray/bin/python
/home/myusername/miniconda3/envs/ray/bin/ray
2019-02-18 12:25:32,108	INFO scripts.py:288 -- Using IP address <redacted> for this node.
2019-02-18 12:25:32,113	INFO node.py:278 -- Process STDOUT and STDERR is being redirected to /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head/logs.
2019-02-18 12:25:32,231	INFO services.py:396 -- Waiting for redis server at 127.0.0.1:31236 to respond...
2019-02-18 12:25:32,348	INFO services.py:396 -- Waiting for redis server at 127.0.0.1:48029 to respond...
2019-02-18 12:25:32,349	INFO services.py:798 -- Starting Redis shard with 10.0 GB max memory.
2019-02-18 12:25:32,384	WARNING services.py:1206 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-02-18 12:25:32,385	INFO services.py:1360 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-02-18 12:25:32,433	INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address redacted:31236

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(redis_address="redacted:31236")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

debug_state.txt

NodeManager:
InitialConfigResources: {GPU,0.000000},{CPU,1.000000}
ClusterResources:
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
8e0ad117daf498c80775ec6bf916f9ab098b0969: 
- total: {GPU,0.000000},{CPU,1.000000}
- avail: {GPU,0.000000},{CPU,1.000000}
ObjectManager:
- num local objects: 0
- num active wait requests: 0
- num unfulfilled push requests: 0
- num pull requests: 0
- num buffered profile events: 0
ObjectDirectory:
- num listeners: 0
- num eviction entries: 0
ObjectStoreNotificationManager:
- num adds processed: 0
- num removes processed: 0
BufferPool:
- get buffer state map size: 0
- create buffer state map size: 0
ConnectionPool:
- num message send connections: 0
- num transfer send connections: 0
- num avail message send connections: 0
- num avail transfer send connections: 0
- num message receive connections: 0
- num transfer receive connections: 0
AsyncGcsClient:
- TaskTable: num lookups: 0, num adds: 0
- ActorTable: num lookups: 0, num appends: 0
- TaskReconstructionLog: num lookups: 0, num appends: 0
- TaskLeaseTable: num lookups: 0, num adds: 0
- HeartbeatTable: num lookups: 0, num adds: 5901
- ErrorTable: num lookups: 0, num appends: 0
- ProfileTable: num lookups: 0, num appends: 0
- ClientTable: num lookups: 0, num appends: 1, cache size: 3, num removed: 0
- DriverTable: num lookups: 0, num appends: 0
WorkerPool:
- num workers: 1
- num drivers: 0
SchedulingQueue:
- num placeable tasks: 0
- num waiting tasks: 0
- num ready tasks: 0
- num running tasks: 0
- num infeasible tasks: 0
- num waiting for actor creation tasks: 0
- num tasks blocked: 0
ReconstructionPolicy:
- num reconstructing: 0
TaskDependencyManager:
- task dep map size: 0
- task req map size: 0
- req objects map size: 0
- local objects map size: 0
- pending tasks map size: 0
LineageCache:
- committed tasks: 0
- child map size: 0
- num subscribed tasks: 0
- lineage size: 0
ActorRegistry:
- num live actors: 0
- num reconstructing actors: 0
- num dead actors: 0
- max num handles: 0
RemoteConnections:
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
DebugString() time ms: 1

monitor.err

2019-02-18 12:25:33,473	INFO autoscaler.py:351 -- StandardAutoscaler: {'cluster_name': 'pipeline', 'min_workers': 2, 'max_workers': 10, 'initial_workers': 2, 'docker': {'image': '', 'container_name': ''}, 'target_utilization_fraction': 0.8, 'idle_timeout_minutes': 10, 'provider': {'type': 'external', 'module': 'pipeline.hpc.UGENodeProvider'}, 'auth': {'ssh_user': 'myusername', 'ssh_private_key': '/home/myusername/.ssh/id_dsa'}, 'head_node': {}, 'worker_nodes': {}, 'file_mounts': {}, 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': []}
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:34,188	INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:34,189	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:34,189	INFO autoscaler.py:588 -- StandardAutoscaler: Launching 2 new nodes
2019-02-18 12:25:34,798	INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (2 pending) (bringup=True)
2019-02-18 12:25:34,798	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:39,122	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:39,122	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=4 Mean=4 Max=4, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:43,406	INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_hbfty_8z: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,613	INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_lclafqgd: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614	INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614	INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:45,373	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:45,373	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=10 Mean=10 Max=10, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:46,386	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:46,481	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:50,393	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:50,393	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=15 Mean=15 Max=15, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:55,470	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:55,470	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=20 Mean=20 Max=20, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:57,619	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:57,657	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:58,199	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got IP [LogTimer=12396ms]
2019-02-18 12:25:58,210	INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:58,222	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got IP [LogTimer=12404ms]
2019-02-18 12:25:58,251	INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:00,507	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:00,507	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=1 Mean=9 Max=25, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:05,560	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:05,560	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=6 Mean=14 Max=30, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:26:10,580	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:10,580	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=11 Mean=19 Max=35, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:15,686	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:15,687	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=16 Mean=24 Max=40, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:20,706	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:20,706	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=21 Mean=29 Max=45, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

2019-02-18 12:30:51,015	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got SSH [LogTimer=292805ms]
2019-02-18 12:30:51,015	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306401ms]
2019-02-18 12:30:51,015	ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Error updating Unable to SSH to node
2019-02-18 12:30:51,086	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got SSH [LogTimer=292835ms]
2019-02-18 12:30:51,086	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306472ms]
2019-02-18 12:30:51,086	ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Error updating Unable to SSH to node
Exception in thread Thread-48:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

Exception in thread Thread-47:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

2019-02-18 12:30:54,114	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:54,114	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=295 Mean=303 Max=319, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=3 Max=5
2019-02-18 12:30:56,944	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:56,944	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=297 Mean=305 Max=321, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:03,112	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:03,112	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=304 Mean=312 Max=328, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:09,433	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:09,434	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=310 Mean=318 Max=334, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

redis-shard_0.out

16540:C 18 Feb 2019 12:25:32.248 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16540:C 18 Feb 2019 12:25:32.248 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16540, just started
16540:C 18 Feb 2019 12:25:32.248 # Configuration loaded
16540:M 18 Feb 2019 12:25:32.249 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16540:M 18 Feb 2019 12:25:32.249 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16540:M 18 Feb 2019 12:25:32.249 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16540:M 18 Feb 2019 12:25:32.249 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16540:M 18 Feb 2019 12:25:32.249 # Server initialized
16540:M 18 Feb 2019 12:25:32.249 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

redis.out

16536:C 18 Feb 2019 12:25:32.135 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16536:C 18 Feb 2019 12:25:32.135 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16536, just started
16536:C 18 Feb 2019 12:25:32.135 # Configuration loaded
16536:M 18 Feb 2019 12:25:32.136 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16536:M 18 Feb 2019 12:25:32.136 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16536:M 18 Feb 2019 12:25:32.136 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16536:M 18 Feb 2019 12:25:32.136 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16536:M 18 Feb 2019 12:25:32.136 # Server initialized
16536:M 18 Feb 2019 12:25:32.136 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
robertnishiharacommented, Feb 28, 2019

Not sure if this is the issue, but there are multiple ports that need to be open between the different machines.

For example, by default there are two Redis servers (one “primary” and one or more “non-primary” shards). The “object manager” and “node manager” on each machine also need their own port.

These ports can be specified in the ray start command with flags like --redis-port=6379, --redis-shard-ports=6380, --node-manager-port=12345, and --object-manager-port=12346. Note that the Redis ones are only relevant on the head node.

For ray.init(), only the “primary” Redis address is relevant. That’s the first one that is printed.

When you get the error message Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node, can you check the logs on the same machine in /tmp/ray/session*/ (just the most recent session) and see if anything looks like it crashed?

0reactions
zbarrycommented, Mar 22, 2019

This was caused by trying to run a driver from a computer that was not running a ray worker. It can be solved by simply starting a “dummy” worker on that computer - ray start --num-cpus 0 --redis-address yourheadnode:port

Read more comments on GitHub >

github_iconTop Results From Across the Web

Head node fails to ssh into worker nodes - Ray Clusters
From the error message it sounds like the ssh keys are not being set up correctly. Can you provide the commands you are...
Read more >
Using Ray on a Large Cluster — Ray 0.01 documentation
All nodes are accessible via ssh keys; Ray is checked out on each node at the ... key that you would use when...
Read more >
How to Find All Failed SSH login Attempts in Linux?
List out the “Failed password” using grep command with /var/log/secure or /var/log/auth.log files · Print IP/ hostname with awk and cut command ...
Read more >
Bash, Conda, Docker, and Ray: What startup commands ...
SSH still not available SSH command failed., retrying in 5 seconds. Success. Updating cluster configuration. [hash= ...
Read more >
Troubleshooting errors in AWS Glue
In case your partitions were not updated in the Data Catalog when you ran an ETL job, these log statements from the DataSink...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found