terminated idle nodes generate a misleading warning
See original GitHub issuestart an AWS cluster with 0 workers (m5.large, default yaml, set idle timeout to 1). run this code:
import ray
ray.init(address="auto")
@ray.remote(num_cpus=2)
... def f():
... time.sleep(60)
ray.get([f.remote for _ in range(2)])
This would result spinning up 1 worker. When this worker becomes idle, autoscaler terminates it, but generates the following warning:
>>> 2021-02-16 04:52:40,168 WARNING worker.py:1034 -- The node with node id 610641aa77c977618fefc691d433d4606876e904 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Tiger VNC startup in RHEL 7 - Red Hat Customer Portal
Issue: What can I do to make the config files load VNC service (with active GUI) for each user automatically at boot-up, to...
Read more >PMU Firmware > Using FSBL to load PMUFW - Xilinx Wiki
PMU powers off all nodes which are unused after all the masters have finished initialization. All other requests prior to the first Set ......
Read more >Fix list for IBM Integration Bus Version 10.0
A MIGRATED BROKER MAY REPORT INTEGRATION SERVERS AS STOPPED BUT THEY ARE ACTUALLY RUNNING. ✓, PI61889, UPGRADING TO Z/OS 2.02 RESULTS IN DUPLICATE...
Read more >How We Minimized the Overhead of Kubernetes in Our Job ...
We did notice something strange: lower CPU idle time in Kubernetes. ... But on this Kubernetes setup, this metric was misleading.
Read more >Stampede2 User Guide - TACC User Portal
Make sure your job script directs all output to $SCRATCH; Once your job is finished, move your output files to $WORK to avoid...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hmm I need to double check, but ideally this should prevent it (that seems to be more natural behavior, and ray stop —force should print this msg)
Desired behavior is that this should be prevented with ray stop Or SIGTERM to ray start --block process