[autoscaler/core] ResourceUsage is empty
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): source
- Ray version: https://github.com/ray-project/ray/commit/322b5166ada11ee0224c7954419f761cb4a2a252
- Python version: 3.7
- Exact command to reproduce: Ran by custom code, but most likely reproducible with any autoscaler example requiring at least 1 worker.
Describe the problem
Autoscaler doesn’t recognize the resource usage on the nodes when running a Tune experiment. The head node runs trials fine, reporting the following status:
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0.0/0 GPUs
Memory usage on this node: 1.8/31.6 GB
Result logdir: /home/ubuntu/ray_results/gym/Ant/v3/2019-07-13T07-16-53-post-corl-sweep-1
Number of trials: 80 ({'RUNNING': 1, 'PENDING': 79})
PENDING trials:
- id=30ae138b-seed=4520: PENDING
- id=30ae138c-seed=9608: PENDING
- id=30ae138d-seed=324: PENDING
- id=30ae138e-seed=7819: PENDING
- id=30ae138f-seed=761: PENDING
- id=30ae1390-seed=5996: PENDING
- id=30ae1391-seed=1984: PENDING
- id=30ae1392-seed=2190: PENDING
- id=30ae1393-seed=254: PENDING
... 61 not shown
- id=30ae13d1-seed=5943: PENDING
- id=30ae13d2-seed=7901: PENDING
- id=30ae13d3-seed=983: PENDING
- id=30ae13d4-seed=5918: PENDING
- id=30ae13d5-seed=7770: PENDING
- id=30ae13d6-seed=4246: PENDING
- id=30ae13d7-seed=4773: PENDING
- id=30ae13d8-seed=2582: PENDING
- id=30ae13d9-seed=7942: PENDING
RUNNING trials:
- id=30ae138a-seed=9072: RUNNING, [8 CPUs, 0.0 GPUs], [pid=2848], 168 s, 5 iter, 5000 ts
Autoscaler logs, however, shows no sign of resource usage:
2019-07-13 07:20:21,068 INFO autoscaler.py:479 -- Ending bringup phase
2019-07-13 07:20:26,103 INFO autoscaler.py:657 -- StandardAutoscaler: 0/0 target nodes (0 pending)
2019-07-13 07:20:26,103 INFO autoscaler.py:658 -- LoadMetrics: MostDelayedHeartbeats={'10.138.0.7': 0.2548184394836426}, NodeIdleSeconds=Min=283 Mean=283 Max=283, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Search OpenShift CI
#2102632 bug 5 months ago
#1910801 bug 23 months ago
#2109273 bug 4 months ago
#1913404 bug 23 months ago
Read more >Cluster Autoscalerの挙動をざっくり理解する - Zenn
Thresholdの値よりも低いNodeを抽出し、CandidateNodeとする。 empty nodeを見つける. empty nodeとは、全てのPodが以下の条件を満たすようなNodeのこと.
Read more >Last issues related to ray-project/ray project - PullAnswer
[Dashboard] The dashboard crashes, i.e. blank screen. 0 Likes 4 Replies ... Pool should document resource usage with ray_remote_args. 1 Likes 0 Replies....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can you try running
tune.run(ray_auto_init=False)
? Maybe we started a separate cluster in Tune.We’ve also been hit by this bug on 0.7.2, downgrading to 0.7.0 seems to have resolved the problem. One thing I noticed when troubleshooting is the ResourceUsage only disappears when there are no available resources. This makes me suspect the total resources are being omitted from the heartbeats for some reason when there are no available resources.
We’ll try upgrading to 0.7.6 in a couple of weeks to see if this is still present, but right now are busy with ICLR reviews.