Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

terminated idle nodes generate a misleading warning

See original GitHub issue

start an AWS cluster with 0 workers (m5.large, default yaml, set idle timeout to 1). run this code:

import ray
ray.init(address="auto")
@ray.remote(num_cpus=2)
... def f():
...     time.sleep(60)
ray.get([f.remote for _ in range(2)])

This would result spinning up 1 worker. When this worker becomes idle, autoscaler terminates it, but generates the following warning:

>>> 2021-02-16 04:52:40,168     WARNING worker.py:1034 -- The node with node id 610641aa77c977618fefc691d433d4606876e904 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

rkooo567commented, Aug 22, 2021

Hmm I need to double check, but ideally this should prevent it (that seems to be more natural behavior, and ray stop —force should print this msg)

0reactions

DmitriGekhtmancommented, Sep 19, 2021

Desired behavior is that this should be prevented with ray stop Or SIGTERM to ray start --block process

Read more comments on GitHub >

Top Results From Across the Web

Tiger VNC startup in RHEL 7 - Red Hat Customer Portal

Issue: What can I do to make the config files load VNC service (with active GUI) for each user automatically at boot-up, to...

PMU Firmware > Using FSBL to load PMUFW - Xilinx Wiki

PMU powers off all nodes which are unused after all the masters have finished initialization. All other requests prior to the first Set ......

Fix list for IBM Integration Bus Version 10.0

A MIGRATED BROKER MAY REPORT INTEGRATION SERVERS AS STOPPED BUT THEY ARE ACTUALLY RUNNING. ✓, PI61889, UPGRADING TO Z/OS 2.02 RESULTS IN DUPLICATE...

How We Minimized the Overhead of Kubernetes in Our Job ...

We did notice something strange: lower CPU idle time in Kubernetes. ... But on this Kubernetes setup, this metric was misleading.

Stampede2 User Guide - TACC User Portal

Make sure your job script directs all output to $SCRATCH; Once your job is finished, move your output files to $WORK to avoid...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

[RFC] loading packaged ray functions

[dask on ray issues]: errors running dask matmul