[autoscaler] New nodes fail to connect after a node gets stopped/removed
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Ray installed from (source or binary): source @ 0eba127c8ceeefe65dbd849eac9c7b621a7a0343
- Ray version: 0.7.0.dev0 @ 0eba127c8ceeefe65dbd849eac9c7b621a7a0343
- Python version: 3.6
- Exact command to reproduce: n/a
Describe the problem
When I start the autoscaler with min_workers=0
and max_workers=50
, and kill a worker node during the scaling up (say, after 15th node has been started), then no new nodes are ever able to connect to the cluster after that. The cluster keeps running normally with the 14 existing nodes, but does not scale up anymore. This is run on GCP.
Source code / logs
Here’s a snippet from the logs of a new worker that’s trying to connect after another node is being stopped:
I0223 02:08:56.371333 2293 store.cc:989] Allowing the Plasma store to use up to 3.00686GB of memory.
I0223 02:08:56.372319 2293 store.cc:1016] Starting object store with directory /dev/shm and huge page support disabled
I0223 02:11:08.773119 2293 store.cc:594] Disconnecting client on fd 10
I0223 02:11:08.773244 2293 store.cc:594] Disconnecting client on fd 13
I0223 02:11:08.773310 2293 store.cc:594] Disconnecting client on fd 14
I0223 02:11:08.773636 2293 store.cc:594] Disconnecting client on fd 16
I0223 02:11:08.785806 2293 store.cc:594] Disconnecting client on fd 17
I0223 02:11:08.785851 2293 store.cc:594] Disconnecting client on fd 18
I0223 02:11:08.785866 2293 store.cc:594] Disconnecting client on fd 15
I0223 02:11:08.785878 2293 store.cc:594] Disconnecting client on fd 12
I0223 02:11:08.785892 2293 store.cc:594] Disconnecting client on fd 11
I0223 02:11:08.898232 2293 store.cc:594] Disconnecting client on fd 9
I0223 02:11:08.898277 2293 store.cc:594] Disconnecting client on fd 7
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0223 02:08:56.585722 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 250147186665cb1509d0d6ab0ea5563240625bfb at 10.138.0.12:43107
I0223 02:08:56.585978 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client e0a591fff8acabd5e15f6a20ca8ef44f71c7f4f5 at 10.138.0.16:44127
I0223 02:08:56.586580 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client b89548b9100e23a8c4de47facde430bbd2ee10f4 at 10.138.0.14:40591
I0223 02:08:56.586761 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 45edd214253f79b47c1a134b8b81d9ab3cd1313b at 10.138.0.17:38925
I0223 02:08:56.586904 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 80012496ea9594c86e2a758cfa54fd85d277b885 at 10.138.0.20:33273
I0223 02:08:56.587082 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 217c8262cac25944175c386c9b89340c8b6cd5e8 at 10.138.0.18:35129
W0223 02:11:07.577844 2294 node_manager.cc:349] Failed to connect to client 217c8262cac25944175c386c9b89340c8b6cd5e8 in ClientAdded. TcpConnect returned status: IOError: Connection timed out. This may be caused by trying to connect to a node manager that has failed.
I0223 02:11:07.577900 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 1ab0c81e7588f26e10bc756626844fd4166bae54 at 10.138.0.119:45825
I0223 02:11:07.579109 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client c69fb1d54002be754119d60ec465ccf466ffa411 at 10.138.0.23:33359
I0223 02:11:07.580124 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 7fb238176680ee8791c38f59f4bcb84144f4e023 at 10.138.0.121:33489
I0223 02:11:07.580852 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client ad6a5ab33f5a52bf717627bd1dbc2a7402d85752 at 10.138.0.22:43065
I0223 02:11:07.582064 2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 3260fd1bd2ed626b9225e4a43af1ec938e4f5773 at 10.138.0.120:41951
W0223 02:11:07.583163 2294 node_manager.cc:414] Received ClientRemoved callback for an unknown client 217c8262cac25944175c386c9b89340c8b6cd5e8.
W0223 02:11:07.583593 2294 node_manager.cc:245] Last heartbeat was sent 130998 ms ago
F0223 02:11:07.592795 2294 node_manager.cc:394] Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1550887867 (unix time) try "date -d @1550887867" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e8000008f6) received by PID 2294 (TID 0x7f0e4e18e740) from PID 2294; stack trace: ***
@ 0x7f0e4dd70390 (unknown)
@ 0x7f0e4d129428 gsignal
@ 0x7f0e4d12b02a abort
@ 0x5ffa96 google::logging_fail()
@ 0x5ffac0 google::LogMessage::Fail()
@ 0x5ffa04 google::LogMessage::SendToLog()
@ 0x5ff346 google::LogMessage::Flush()
@ 0x5ff141 google::LogMessage::~LogMessage()
@ 0x525270 ray::RayLog::~RayLog()
@ 0x57372b ray::raylet::NodeManager::ClientRemoved()
@ 0x4e0af2 ray::gcs::ClientTable::HandleNotification()
@ 0x4e13ab _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8UniqueIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_S6_SG_E_clES3_S6_SG_EUlS3_S6_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
@ 0x4f4ea4 _ZZN3ray3gcs3LogINS_8UniqueIDE15ClientTableDataE9SubscribeERKS2_S6_RKSt8functionIFvPNS0_14AsyncGcsClientES6_RKSt6vectorI16ClientTableDataTSaISB_EEEERKS7_IFvS9_EEENKUlRKSsE_clESP_
@ 0x520629 (anonymous namespace)::ProcessCallback()
@ 0x5215a9 ray::gcs::SubscribeRedisCallback()
@ 0x58706d redisProcessCallbacks
@ 0x52495d RedisAsioClient::handle_read()
@ 0x524df5 boost::asio::detail::reactive_null_buffers_op<>::do_complete()
@ 0x4c3cd5 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
@ 0x4c6f12 boost::asio::detail::scheduler::run()
@ 0x4ba94a main
@ 0x7f0e4d114830 __libc_start_main
@ 0x4c0379 _start
@ 0x0 (unknown)
Ray worker pid: 2304
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 118, in <module>
driver_id=None)
File "/home/ubuntu/ray/python/ray/utils.py", line 67, in push_error_to_driver
time.time())
File "_raylet.pyx", line 294, in ray._raylet.RayletClient.push_error
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Connection closed unexpectedly.
Ray worker pid: 2305
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.
During handling of the above exception, another exception occurred:
Here’s a link to the /tmp/ray
from the newly created worker that’s trying to connect to the cluster: https://slack-files.com/T25RKHCD6-FGG5B0MFZ-1508f638f8
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Cluster Autoscaler: How It Works and Solving Common ...
When the number of pods that are pending or “unschedulable” increases, indicating there are insufficient resources in the cluster, CA adds new nodes...
Read more >2053343 – Cluster Autoscaler not scaling down nodes which ...
Description of problem: A cluster's cluster autoscaler has been configured to scale down nodes, but does not seem to be scaling down nodes...
Read more >create-nodegroup — AWS CLI 1.27.32 Command Reference
If you use Cluster Autoscaler, you shouldn't change the desiredSize value directly, as this can cause the Cluster Autoscaler to suddenly scale up...
Read more >AWS EKS Kubernetes Cluster Autoscaler - STACKSIMPLIFY
Step-09: Cluster Scale UP: Scale our application to 30 pods ¶. In 2 to 3 minutes, one after the other new nodes will...
Read more >Karpenter vs Cluster Autoscaler ☸️ ⚒️ - Kubes&Clouds
Karpenter vs Cluster Autoscaler: Getting the size of a Kubernetes cluster right is not an easy task, if the number of nodes provisioned...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That’s unrelated, the timeout here is internal to the c++ backend.
I reviewed the code in ClientTable, I think the logic could be right, but the time sequence is not as expected because there is a callback in another callback. In
Raylet::RegisterGcs
,client_table().Connect
is called first andnode_manager_.RegisterGcs()
which containsclient_table().RegisterClientAddedCallback
is called later. If the callback functionnotification_callback
inclient_table().Connect
is finished beforeclient_table().RegisterClientAddedCallback
, the logic is correct. However, this function is called in the callback function ofAppend
. That is to sayclient_table().RegisterClientAddedCallback
could be called ahead ofnotification_callback
. In this caseHandleNotification
will callclient_added_callback_
to connect the dead node unexpectedly