question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[autoscaler/core] ResourceUsage is empty

See original GitHub issue

System information

Describe the problem

Autoscaler doesn’t recognize the resource usage on the nodes when running a Tune experiment. The head node runs trials fine, reporting the following status:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0.0/0 GPUs
Memory usage on this node: 1.8/31.6 GB
Result logdir: /home/ubuntu/ray_results/gym/Ant/v3/2019-07-13T07-16-53-post-corl-sweep-1
Number of trials: 80 ({'RUNNING': 1, 'PENDING': 79})
PENDING trials:
 - id=30ae138b-seed=4520:       PENDING
 - id=30ae138c-seed=9608:       PENDING
 - id=30ae138d-seed=324:        PENDING
 - id=30ae138e-seed=7819:       PENDING
 - id=30ae138f-seed=761:        PENDING
 - id=30ae1390-seed=5996:       PENDING
 - id=30ae1391-seed=1984:       PENDING
 - id=30ae1392-seed=2190:       PENDING
 - id=30ae1393-seed=254:        PENDING
  ... 61 not shown
 - id=30ae13d1-seed=5943:       PENDING
 - id=30ae13d2-seed=7901:       PENDING
 - id=30ae13d3-seed=983:        PENDING
 - id=30ae13d4-seed=5918:       PENDING
 - id=30ae13d5-seed=7770:       PENDING
 - id=30ae13d6-seed=4246:       PENDING
 - id=30ae13d7-seed=4773:       PENDING
 - id=30ae13d8-seed=2582:       PENDING
 - id=30ae13d9-seed=7942:       PENDING
RUNNING trials:
 - id=30ae138a-seed=9072:       RUNNING, [8 CPUs, 0.0 GPUs], [pid=2848], 168 s, 5 iter, 5000 ts

Autoscaler logs, however, shows no sign of resource usage:

2019-07-13 07:20:21,068 INFO autoscaler.py:479 -- Ending bringup phase
2019-07-13 07:20:26,103 INFO autoscaler.py:657 -- StandardAutoscaler: 0/0 target nodes (0 pending)
2019-07-13 07:20:26,103 INFO autoscaler.py:658 -- LoadMetrics: MostDelayedHeartbeats={'10.138.0.7': 0.2548184394836426}, NodeIdleSeconds=Min=283 Mean=283 Max=283, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Jul 13, 2019

Can you try running tune.run(ray_auto_init=False)? Maybe we started a separate cluster in Tune.

0reactions
AdamGleavecommented, Nov 7, 2019

We’ve also been hit by this bug on 0.7.2, downgrading to 0.7.0 seems to have resolved the problem. One thing I noticed when troubleshooting is the ResourceUsage only disappears when there are no available resources. This makes me suspect the total resources are being omitted from the heartbeats for some reason when there are no available resources.

We’ll try upgrading to 0.7.6 in a couple of weeks to see if this is still present, but right now are busy with ICLR reviews.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Search OpenShift CI
#2102632 bug 5 months ago #1910801 bug 23 months ago #2109273 bug 4 months ago #1913404 bug 23 months ago
Read more >
Cluster Autoscalerの挙動をざっくり理解する - Zenn
Thresholdの値よりも低いNodeを抽出し、CandidateNodeとする。 empty nodeを見つける. empty nodeとは、全てのPodが以下の条件を満たすようなNodeのこと.
Read more >
Last issues related to ray-project/ray project - PullAnswer
[Dashboard] The dashboard crashes, i.e. blank screen. 0 Likes 4 Replies ... Pool should document resource usage with ray_remote_args. 1 Likes 0 Replies....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found