No logging of SSH failures of Ray nodes?
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 6 HPC environment
- Ray installed from (source or binary): pip
- Ray version: 0.6.3
- Python version: 3.7
- Exact command to reproduce:
I’ve designed a custom NodeProvider
to provide autoscaling support on Univa Grid Engine HPC environments. I’m having the hardest time, though, debugging errors w.r.t. permanent “waiting-for-ssh” states. I have at one point gotten a Ray head node properly qsub
’d and running via ray up blarg.yml -y
(which then goes on to spawn two initial workers as configured), but the workers are not able to do that dummy SSH step back to the head for whatever reason. Unfortunately, I’ve run through all the logs that Ray and the job script spit out, and have had no luck finding specifics of why the workers’ SSH efforts have failed. Is there a particular way to get the reason(s) behind the SSH failure exported to the log files?
Now that I’ve been messing with this a bit, e.g., creating new SSH keys, I am having trouble even getting the qsub
’d ray start
connecting to the ray up
process (with the “waiting-for-ssh” problem). I’m sure it’s some configuration stuff on my end, but because I get so little feedback on the failure mode, this has been painful to say the least to debug, haha.
Command to start the head node that my NodeProvider
qsub
s from ray up
:
ray start --block --num-cpus 1 --temp-dir /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head --head --redis-port 31236 --autoscaling-config /home/myusername/ray_cluster//autoscaling_config.yml
Note that I can see the head node job + the two worker jobs have been submitted and are running on the cluster
Log outputs:
monitor.out
This is the output from the head node’s NodeProvider
(my custom debugging output). It’s too long to paste in this issue, so find here:
https://gist.github.com/zbarry/0282655b01bc51f247a7a8fa6477e9e9
Cluster / job log
[INFO] [1550510728] [Mon Feb 18 12:25:28 EST 2019] [16366] [18509843] Starting the workflow
/home/myusername/miniconda3/envs/ray/bin/python
/home/myusername/miniconda3/envs/ray/bin/ray
2019-02-18 12:25:32,108 INFO scripts.py:288 -- Using IP address <redacted> for this node.
2019-02-18 12:25:32,113 INFO node.py:278 -- Process STDOUT and STDERR is being redirected to /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head/logs.
2019-02-18 12:25:32,231 INFO services.py:396 -- Waiting for redis server at 127.0.0.1:31236 to respond...
2019-02-18 12:25:32,348 INFO services.py:396 -- Waiting for redis server at 127.0.0.1:48029 to respond...
2019-02-18 12:25:32,349 INFO services.py:798 -- Starting Redis shard with 10.0 GB max memory.
2019-02-18 12:25:32,384 WARNING services.py:1206 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-02-18 12:25:32,385 INFO services.py:1360 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-02-18 12:25:32,433 INFO scripts.py:319 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --redis-address redacted:31236
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(redis_address="redacted:31236")
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
debug_state.txt
NodeManager:
InitialConfigResources: {GPU,0.000000},{CPU,1.000000}
ClusterResources:
9314824bc109cd85b897aacc8d6b917b2bcc80d2:
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
499b4c9a4193e9825cc616cf2051285f576b6dc0:
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
8e0ad117daf498c80775ec6bf916f9ab098b0969:
- total: {GPU,0.000000},{CPU,1.000000}
- avail: {GPU,0.000000},{CPU,1.000000}
ObjectManager:
- num local objects: 0
- num active wait requests: 0
- num unfulfilled push requests: 0
- num pull requests: 0
- num buffered profile events: 0
ObjectDirectory:
- num listeners: 0
- num eviction entries: 0
ObjectStoreNotificationManager:
- num adds processed: 0
- num removes processed: 0
BufferPool:
- get buffer state map size: 0
- create buffer state map size: 0
ConnectionPool:
- num message send connections: 0
- num transfer send connections: 0
- num avail message send connections: 0
- num avail transfer send connections: 0
- num message receive connections: 0
- num transfer receive connections: 0
AsyncGcsClient:
- TaskTable: num lookups: 0, num adds: 0
- ActorTable: num lookups: 0, num appends: 0
- TaskReconstructionLog: num lookups: 0, num appends: 0
- TaskLeaseTable: num lookups: 0, num adds: 0
- HeartbeatTable: num lookups: 0, num adds: 5901
- ErrorTable: num lookups: 0, num appends: 0
- ProfileTable: num lookups: 0, num appends: 0
- ClientTable: num lookups: 0, num appends: 1, cache size: 3, num removed: 0
- DriverTable: num lookups: 0, num appends: 0
WorkerPool:
- num workers: 1
- num drivers: 0
SchedulingQueue:
- num placeable tasks: 0
- num waiting tasks: 0
- num ready tasks: 0
- num running tasks: 0
- num infeasible tasks: 0
- num waiting for actor creation tasks: 0
- num tasks blocked: 0
ReconstructionPolicy:
- num reconstructing: 0
TaskDependencyManager:
- task dep map size: 0
- task req map size: 0
- req objects map size: 0
- local objects map size: 0
- pending tasks map size: 0
LineageCache:
- committed tasks: 0
- child map size: 0
- num subscribed tasks: 0
- lineage size: 0
ActorRegistry:
- num live actors: 0
- num reconstructing actors: 0
- num dead actors: 0
- max num handles: 0
RemoteConnections:
499b4c9a4193e9825cc616cf2051285f576b6dc0:
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
9314824bc109cd85b897aacc8d6b917b2bcc80d2:
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
DebugString() time ms: 1
monitor.err
2019-02-18 12:25:33,473 INFO autoscaler.py:351 -- StandardAutoscaler: {'cluster_name': 'pipeline', 'min_workers': 2, 'max_workers': 10, 'initial_workers': 2, 'docker': {'image': '', 'container_name': ''}, 'target_utilization_fraction': 0.8, 'idle_timeout_minutes': 10, 'provider': {'type': 'external', 'module': 'pipeline.hpc.UGENodeProvider'}, 'auth': {'ssh_user': 'myusername', 'ssh_private_key': '/home/myusername/.ssh/id_dsa'}, 'head_node': {}, 'worker_nodes': {}, 'file_mounts': {}, 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': []}
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:34,188 INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:34,189 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:34,189 INFO autoscaler.py:588 -- StandardAutoscaler: Launching 2 new nodes
2019-02-18 12:25:34,798 INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (2 pending) (bringup=True)
2019-02-18 12:25:34,798 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:39,122 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:39,122 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=4 Mean=4 Max=4, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:43,406 INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_hbfty_8z: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,613 INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_lclafqgd: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614 INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614 INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:45,373 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:45,373 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=10 Mean=10 Max=10, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:46,386 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:46,481 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:50,393 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:50,393 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=15 Mean=15 Max=15, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:55,470 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:55,470 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=20 Mean=20 Max=20, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:57,619 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:57,657 INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:58,199 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got IP [LogTimer=12396ms]
2019-02-18 12:25:58,210 INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:58,222 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got IP [LogTimer=12404ms]
2019-02-18 12:25:58,251 INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:00,507 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:00,507 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=1 Mean=9 Max=25, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:05,560 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:05,560 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=6 Mean=14 Max=30, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:26:10,580 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:10,580 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=11 Mean=19 Max=35, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:15,686 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:15,687 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=16 Mean=24 Max=40, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:20,706 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:20,706 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=21 Mean=29 Max=45, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
~~ snip ~~
2019-02-18 12:30:51,015 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got SSH [LogTimer=292805ms]
2019-02-18 12:30:51,015 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306401ms]
2019-02-18 12:30:51,015 ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Error updating Unable to SSH to node
2019-02-18 12:30:51,086 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got SSH [LogTimer=292835ms]
2019-02-18 12:30:51,086 INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306472ms]
2019-02-18 12:30:51,086 ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Error updating Unable to SSH to node
Exception in thread Thread-48:
Traceback (most recent call last):
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
raise e
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
self.do_update()
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Exception in thread Thread-47:
Traceback (most recent call last):
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
raise e
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
self.do_update()
File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
2019-02-18 12:30:54,114 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:54,114 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=295 Mean=303 Max=319, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=3 Max=5
2019-02-18 12:30:56,944 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:56,944 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=297 Mean=305 Max=321, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:03,112 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:03,112 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=304 Mean=312 Max=328, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:09,433 INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:09,434 INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=310 Mean=318 Max=334, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
~~ snip ~~
redis-shard_0.out
16540:C 18 Feb 2019 12:25:32.248 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16540:C 18 Feb 2019 12:25:32.248 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16540, just started
16540:C 18 Feb 2019 12:25:32.248 # Configuration loaded
16540:M 18 Feb 2019 12:25:32.249 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16540:M 18 Feb 2019 12:25:32.249 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16540:M 18 Feb 2019 12:25:32.249 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16540:M 18 Feb 2019 12:25:32.249 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16540:M 18 Feb 2019 12:25:32.249 # Server initialized
16540:M 18 Feb 2019 12:25:32.249 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis.out
16536:C 18 Feb 2019 12:25:32.135 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16536:C 18 Feb 2019 12:25:32.135 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16536, just started
16536:C 18 Feb 2019 12:25:32.135 # Configuration loaded
16536:M 18 Feb 2019 12:25:32.136 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16536:M 18 Feb 2019 12:25:32.136 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16536:M 18 Feb 2019 12:25:32.136 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16536:M 18 Feb 2019 12:25:32.136 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16536:M 18 Feb 2019 12:25:32.136 # Server initialized
16536:M 18 Feb 2019 12:25:32.136 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (14 by maintainers)
Top GitHub Comments
Not sure if this is the issue, but there are multiple ports that need to be open between the different machines.
For example, by default there are two Redis servers (one “primary” and one or more “non-primary” shards). The “object manager” and “node manager” on each machine also need their own port.
These ports can be specified in the
ray start
command with flags like--redis-port=6379
,--redis-shard-ports=6380
,--node-manager-port=12345
, and--object-manager-port=12346
. Note that the Redis ones are only relevant on the head node.For
ray.init()
, only the “primary” Redis address is relevant. That’s the first one that is printed.When you get the error message
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node
, can you check the logs on the same machine in/tmp/ray/session*/
(just the most recent session) and see if anything looks like it crashed?This was caused by trying to run a driver from a computer that was not running a ray worker. It can be solved by simply starting a “dummy” worker on that computer -
ray start --num-cpus 0 --redis-address yourheadnode:port