[autoscaler] wrongly shuts down all nodes due to one bad node.
See original GitHub issuethe autoscaler tries take down one idle (false positive though, the node was running at 100% cpu) node but end up killing every nodes due to an internal key error. It seems to get confused with the mapping. this is a serious issue as all my progress get lost. I was using 16 placement group (on one machine each).
2021-02-17 14:55:34,817 INFO monitor.py:207 β :event_summary:Removing 1 nodes of type cpu_48_spot (idle).
2021-02-17 14:55:34,817 INFO monitor.py:207 β :event_summary:Adding 1 nodes of type cpu_48_spot.
2021-02-17 14:55:40,430 INFO load_metrics.py:102 β LoadMetrics: Removed mapping: 172.31.23.116 - 1613573430.7000167
2021-02-17 14:55:40,430 INFO load_metrics.py:109 β LoadMetrics: Removed 1 stale ip mappings: {β172.31.23.116β} not in {β172.31.16.240β, β172.31.27.173β, β172.31.26.163β, β172.31.20.177β, β172.31.25.79β, β172.31.28.159β, β172.31.21.227β, β172.31.24.131β, β172.31.31.164β, β172.31.22.24β, β172.31.26.41β, β172.31.19.126β, β172.31.22.66β, β172.31.26.13β, β172.31.30.105β, β172.31.25.157β, β172.31.27.26β}
2021-02-17 14:55:40,744 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:40,744 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:46,909 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:46,909 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:47,082 INFO monitor.py:207 β :event_summary:Resized to 724 CPUs.
2021-02-17 14:55:52,997 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:52,998 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:55:58,965 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:58,965 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:05,002 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:05,003 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:10,999 ERROR autoscaler.py:266 β StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:11,000 ERROR autoscaler.py:139 β StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:11,001 CRITICAL autoscaler.py:152 β StandardAutoscaler: Too many errors, abort.
2021-02-17 14:56:11,001 ERROR monitor.py:271 β Error in monitor loop
Traceback (most recent call last):
File β/home/centos/.local/lib/python3.7/site-packages/ray/monitor.pyβ, line 269, in run
self._run()
File β/home/centos/.local/lib/python3.7/site-packages/ray/monitor.pyβ, line 202, in _run
self.autoscaler.update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 154, in update
raise e
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 137, in update
self._update()
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.pyβ, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.pyβ, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: βi-02b77234ffad2072cβ
2021-02-17 14:56:11,002 ERROR autoscaler.py:724 β StandardAutoscaler: kill_workers triggered
2021-02-17 14:56:11,453 ERROR autoscaler.py:729 β StandardAutoscaler: terminated 16 node(s)
2021-02-17 14:56:11,453 INFO monitor.py:250 β Monitor: Exception caught. Taking down workersβ¦
2021-02-17 14:56:11,680 INFO monitor.py:262 β Monitor: Workers taken down.
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (13 by maintainers)
Top Results From Across the Web
About cluster autoscaling | Google Kubernetes Engine (GKE)
This page explains how Google Kubernetes Engine (GKE) automatically resizes your Standard cluster's node pools based on the demands of your workloads.
Read more >Kubernetes Autoscaling: Getting Started + Examples
This method typically comes to your rescue when pods cannot be scaled to their maximum capacity because there are not enough nodes to...
Read more >Best practices for resizing and automatic scaling in ...
When an EMR scale cluster is scaled down, two different decommission processes are triggered on the nodes that will be terminated.
Read more >10 most common mistakes using kubernetes
CPU request are usually either not set or set very low (so that we can fit a lot of pods on each node)...
Read more >Use Horizontal Node Autoscaling | Documentation - Support
The cluster is not scaling down ... Check if the autoscaler finds a candidate to remove. You should find this in the logs:...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@jennicetao thanks for the report. What does your workload look like? Do you have unused placement groups in the cluster?
As a short term fix to unblock yourself, can you set
idle_timeout_minutes: 999999
in your cluster config for now?Can we fix that first?
Eric
On Mon, Mar 1, 2021 at 11:31 AM Alex Wu notifications@github.com wrote: