question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

terminated idle nodes generate a misleading warning

See original GitHub issue

start an AWS cluster with 0 workers (m5.large, default yaml, set idle timeout to 1). run this code:

import ray
ray.init(address="auto")
@ray.remote(num_cpus=2)
... def f():
...     time.sleep(60)
ray.get([f.remote for _ in range(2)])

This would result spinning up 1 worker. When this worker becomes idle, autoscaler terminates it, but generates the following warning:

>>> 2021-02-16 04:52:40,168     WARNING worker.py:1034 -- The node with node id 610641aa77c977618fefc691d433d4606876e904 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
rkooo567commented, Aug 22, 2021

Hmm I need to double check, but ideally this should prevent it (that seems to be more natural behavior, and ray stop —force should print this msg)

0reactions
DmitriGekhtmancommented, Sep 19, 2021

Desired behavior is that this should be prevented with ray stop Or SIGTERM to ray start --block process

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tiger VNC startup in RHEL 7 - Red Hat Customer Portal
Issue: What can I do to make the config files load VNC service (with active GUI) for each user automatically at boot-up, to...
Read more >
PMU Firmware > Using FSBL to load PMUFW - Xilinx Wiki
PMU powers off all nodes which are unused after all the masters have finished initialization. All other requests prior to the first Set ......
Read more >
Fix list for IBM Integration Bus Version 10.0
A MIGRATED BROKER MAY REPORT INTEGRATION SERVERS AS STOPPED BUT THEY ARE ACTUALLY RUNNING. ✓, PI61889, UPGRADING TO Z/OS 2.02 RESULTS IN DUPLICATE...
Read more >
How We Minimized the Overhead of Kubernetes in Our Job ...
We did notice something strange: lower CPU idle time in Kubernetes. ... But on this Kubernetes setup, this metric was misleading.
Read more >
Stampede2 User Guide - TACC User Portal
Make sure your job script directs all output to $SCRATCH; Once your job is finished, move your output files to $WORK to avoid...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found