question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[autoscaler] New nodes fail to connect after a node gets stopped/removed

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): source @ 0eba127c8ceeefe65dbd849eac9c7b621a7a0343
  • Ray version: 0.7.0.dev0 @ 0eba127c8ceeefe65dbd849eac9c7b621a7a0343
  • Python version: 3.6
  • Exact command to reproduce: n/a

Describe the problem

When I start the autoscaler with min_workers=0 and max_workers=50, and kill a worker node during the scaling up (say, after 15th node has been started), then no new nodes are ever able to connect to the cluster after that. The cluster keeps running normally with the 14 existing nodes, but does not scale up anymore. This is run on GCP.

Source code / logs

Here’s a snippet from the logs of a new worker that’s trying to connect after another node is being stopped:

I0223 02:08:56.371333  2293 store.cc:989] Allowing the Plasma store to use up to 3.00686GB of memory.
I0223 02:08:56.372319  2293 store.cc:1016] Starting object store with directory /dev/shm and huge page support disabled
I0223 02:11:08.773119  2293 store.cc:594] Disconnecting client on fd 10
I0223 02:11:08.773244  2293 store.cc:594] Disconnecting client on fd 13
I0223 02:11:08.773310  2293 store.cc:594] Disconnecting client on fd 14
I0223 02:11:08.773636  2293 store.cc:594] Disconnecting client on fd 16
I0223 02:11:08.785806  2293 store.cc:594] Disconnecting client on fd 17
I0223 02:11:08.785851  2293 store.cc:594] Disconnecting client on fd 18
I0223 02:11:08.785866  2293 store.cc:594] Disconnecting client on fd 15
I0223 02:11:08.785878  2293 store.cc:594] Disconnecting client on fd 12
I0223 02:11:08.785892  2293 store.cc:594] Disconnecting client on fd 11
I0223 02:11:08.898232  2293 store.cc:594] Disconnecting client on fd 9
I0223 02:11:08.898277  2293 store.cc:594] Disconnecting client on fd 7
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0223 02:08:56.585722  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 250147186665cb1509d0d6ab0ea5563240625bfb at 10.138.0.12:43107
I0223 02:08:56.585978  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client e0a591fff8acabd5e15f6a20ca8ef44f71c7f4f5 at 10.138.0.16:44127
I0223 02:08:56.586580  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client b89548b9100e23a8c4de47facde430bbd2ee10f4 at 10.138.0.14:40591
I0223 02:08:56.586761  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 45edd214253f79b47c1a134b8b81d9ab3cd1313b at 10.138.0.17:38925
I0223 02:08:56.586904  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 80012496ea9594c86e2a758cfa54fd85d277b885 at 10.138.0.20:33273
I0223 02:08:56.587082  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 217c8262cac25944175c386c9b89340c8b6cd5e8 at 10.138.0.18:35129
W0223 02:11:07.577844  2294 node_manager.cc:349] Failed to connect to client 217c8262cac25944175c386c9b89340c8b6cd5e8 in ClientAdded. TcpConnect returned status: IOError: Connection timed out. This may be caused by trying to connect to a node manager that has failed.
I0223 02:11:07.577900  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 1ab0c81e7588f26e10bc756626844fd4166bae54 at 10.138.0.119:45825
I0223 02:11:07.579109  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client c69fb1d54002be754119d60ec465ccf466ffa411 at 10.138.0.23:33359
I0223 02:11:07.580124  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 7fb238176680ee8791c38f59f4bcb84144f4e023 at 10.138.0.121:33489
I0223 02:11:07.580852  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client ad6a5ab33f5a52bf717627bd1dbc2a7402d85752 at 10.138.0.22:43065
I0223 02:11:07.582064  2294 node_manager.cc:365] [ConnectClient] Trying to connect to client 3260fd1bd2ed626b9225e4a43af1ec938e4f5773 at 10.138.0.120:41951
W0223 02:11:07.583163  2294 node_manager.cc:414] Received ClientRemoved callback for an unknown client 217c8262cac25944175c386c9b89340c8b6cd5e8.
W0223 02:11:07.583593  2294 node_manager.cc:245] Last heartbeat was sent 130998 ms ago
F0223 02:11:07.592795  2294 node_manager.cc:394]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1550887867 (unix time) try "date -d @1550887867" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e8000008f6) received by PID 2294 (TID 0x7f0e4e18e740) from PID 2294; stack trace: ***
@     0x7f0e4dd70390 (unknown)
@     0x7f0e4d129428 gsignal
@     0x7f0e4d12b02a abort
@           0x5ffa96 google::logging_fail()
@           0x5ffac0 google::LogMessage::Fail()
@           0x5ffa04 google::LogMessage::SendToLog()
@           0x5ff346 google::LogMessage::Flush()
@           0x5ff141 google::LogMessage::~LogMessage()
@           0x525270 ray::RayLog::~RayLog()
@           0x57372b ray::raylet::NodeManager::ClientRemoved()
@           0x4e0af2 ray::gcs::ClientTable::HandleNotification()
@           0x4e13ab _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8UniqueIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_S6_SG_E_clES3_S6_SG_EUlS3_S6_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
@           0x4f4ea4 _ZZN3ray3gcs3LogINS_8UniqueIDE15ClientTableDataE9SubscribeERKS2_S6_RKSt8functionIFvPNS0_14AsyncGcsClientES6_RKSt6vectorI16ClientTableDataTSaISB_EEEERKS7_IFvS9_EEENKUlRKSsE_clESP_
@           0x520629 (anonymous namespace)::ProcessCallback()
@           0x5215a9 ray::gcs::SubscribeRedisCallback()
@           0x58706d redisProcessCallbacks
@           0x52495d RedisAsioClient::handle_read()
@           0x524df5 boost::asio::detail::reactive_null_buffers_op<>::do_complete()
@           0x4c3cd5 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
@           0x4c6f12 boost::asio::detail::scheduler::run()
@           0x4ba94a main
@     0x7f0e4d114830 __libc_start_main
@           0x4c0379 _start
@                0x0 (unknown)
Ray worker pid: 2304
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 118, in <module>
driver_id=None)
File "/home/ubuntu/ray/python/ray/utils.py", line 67, in push_error_to_driver
time.time())
File "_raylet.pyx", line 294, in ray._raylet.RayletClient.push_error
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Connection closed unexpectedly.
Ray worker pid: 2305
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.

During handling of the above exception, another exception occurred:

Here’s a link to the /tmp/ray from the newly created worker that’s trying to connect to the cluster: https://slack-files.com/T25RKHCD6-FGG5B0MFZ-1508f638f8

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Feb 24, 2019

That’s unrelated, the timeout here is internal to the c++ backend.

0reactions
guoyuhongcommented, Feb 25, 2019

I reviewed the code in ClientTable, I think the logic could be right, but the time sequence is not as expected because there is a callback in another callback. In Raylet::RegisterGcs, client_table().Connect is called first and node_manager_.RegisterGcs() which contains client_table().RegisterClientAddedCallback is called later. If the callback function notification_callback in client_table().Connect is finished before client_table().RegisterClientAddedCallback, the logic is correct. However, this function is called in the callback function of Append. That is to say client_table().RegisterClientAddedCallback could be called ahead of notification_callback. In this case HandleNotification will call client_added_callback_ to connect the dead node unexpectedly

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster Autoscaler: How It Works and Solving Common ...
When the number of pods that are pending or “unschedulable” increases, indicating there are insufficient resources in the cluster, CA adds new nodes...
Read more >
2053343 – Cluster Autoscaler not scaling down nodes which ...
Description of problem: A cluster's cluster autoscaler has been configured to scale down nodes, but does not seem to be scaling down nodes...
Read more >
create-nodegroup — AWS CLI 1.27.32 Command Reference
If you use Cluster Autoscaler, you shouldn't change the desiredSize value directly, as this can cause the Cluster Autoscaler to suddenly scale up...
Read more >
AWS EKS Kubernetes Cluster Autoscaler - STACKSIMPLIFY
Step-09: Cluster Scale UP: Scale our application to 30 pods ¶. In 2 to 3 minutes, one after the other new nodes will...
Read more >
Karpenter vs Cluster Autoscaler ☸️ ⚒️ - Kubes&Clouds
Karpenter vs Cluster Autoscaler: Getting the size of a Kubernetes cluster right is not an easy task, if the number of nodes provisioned...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found