question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[autoscaler] wrongly shuts down all nodes due to one bad node.

See original GitHub issue

the autoscaler tries take down one idle (false positive though, the node was running at 100% cpu) node but end up killing every nodes due to an internal key error. It seems to get confused with the mapping. this is a serious issue as all my progress get lost. I was using 16 placement group (on one machine each).

2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Removing 1 nodes of type cpu_48_spot (idle).
2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Adding 1 nodes of type cpu_48_spot.
2021-02-17 14:55:40,430 INFO load_metrics.py:102 – LoadMetrics: Removed mapping: 172.31.23.116 - 1613573430.7000167
2021-02-17 14:55:40,430 INFO load_metrics.py:109 – LoadMetrics: Removed 1 stale ip mappings: {β€˜172.31.23.116’} not in {β€˜172.31.16.240’, β€˜172.31.27.173’, β€˜172.31.26.163’, β€˜172.31.20.177’, β€˜172.31.25.79’, β€˜172.31.28.159’, β€˜172.31.21.227’, β€˜172.31.24.131’, β€˜172.31.31.164’, β€˜172.31.22.24’, β€˜172.31.26.41’, β€˜172.31.19.126’, β€˜172.31.22.66’, β€˜172.31.26.13’, β€˜172.31.30.105’, β€˜172.31.25.157’, β€˜172.31.27.26’}
2021-02-17 14:55:40,744 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:40,744 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:46,909 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:46,909 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:47,082 INFO monitor.py:207 – :event_summary:Resized to 724 CPUs.
2021-02-17 14:55:52,997 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:52,998 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:55:58,965 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:58,965 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:05,002 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:05,003 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:10,999 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:11,000 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:11,001 CRITICAL autoscaler.py:152 – StandardAutoscaler: Too many errors, abort.
2021-02-17 14:56:11,001 ERROR monitor.py:271 – Error in monitor loop
Traceback (most recent call last):
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 269, in run
self._run()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 202, in _run
self.autoscaler.update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 154, in update
raise e
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File β€œ/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: β€˜i-02b77234ffad2072c’
2021-02-17 14:56:11,002 ERROR autoscaler.py:724 – StandardAutoscaler: kill_workers triggered
2021-02-17 14:56:11,453 ERROR autoscaler.py:729 – StandardAutoscaler: terminated 16 node(s)
2021-02-17 14:56:11,453 INFO monitor.py:250 – Monitor: Exception caught. Taking down workers…
2021-02-17 14:56:11,680 INFO monitor.py:262 – Monitor: Workers taken down.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:19 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
wuisawesomecommented, Feb 23, 2021

@jennicetao thanks for the report. What does your workload look like? Do you have unused placement groups in the cluster?

As a short term fix to unblock yourself, can you set idle_timeout_minutes: 999999 in your cluster config for now?

0reactions
ericlcommented, Mar 1, 2021

Can we fix that first?

Eric

On Mon, Mar 1, 2021 at 11:31 AM Alex Wu notifications@github.com wrote:

re (1), looks like the aws node provider removes the node from its cache when terminating a node. The key error looks like a side effect of the fact that the node is not being terminated properly.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/14264#issuecomment-788211330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSXK4VXASQQCXLUACBLTBPTS3ANCNFSM4YBQH57A .

Read more comments on GitHub >

github_iconTop Results From Across the Web

About cluster autoscaling | Google Kubernetes Engine (GKE)
This page explains how Google Kubernetes Engine (GKE) automatically resizes your Standard cluster's node pools based on the demands of your workloads.
Read more >
Kubernetes Autoscaling: Getting Started + Examples
This method typically comes to your rescue when pods cannot be scaled to their maximum capacity because there are not enough nodes to...
Read more >
Best practices for resizing and automatic scaling in ...
When an EMR scale cluster is scaled down, two different decommission processes are triggered on the nodes that will be terminated.
Read more >
10 most common mistakes using kubernetes
CPU request are usually either not set or set very low (so that we can fit a lot of pods on each node)...
Read more >
Use Horizontal Node Autoscaling | Documentation - Support
The cluster is not scaling down ... Check if the autoscaler finds a candidate to remove. You should find this in the logs:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found