[Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Tune
What happened + What you expected to happen
Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job
The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3. I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.
At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.
Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
ForestCover Ray tune 1hr job ran from 11:22-12:22
Higgs Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.
Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate | iter | total time (s) | metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING | 10.0.4.71:1938 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 | 18 | 3566.91 | 0.489641 |
| trial_3787a9c4 | RUNNING | 10.0.6.5:17263 | 0.9 | 4096 | 7 | 24 | 1.5 | 64 | 0 | 256 | 0.9 | 10000 | 0.01 | | | |
| trial_39a0ad6e | RUNNING | 10.0.4.38:8657 | 0.8 | 256 | 3 | 16 | 1.2 | 64 | 0.001 | 4096 | 0.95 | 2000 | 0.005 | 4 | 1268.2 | 0.50659 |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 | 1 | 913.295 | 0.53046 |
| trial_059befa6 | TERMINATED | 10.0.3.245:282 | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 | 1 | 316.455 | 0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281 | 0.8 | 1024 | 7 | 24 | 2 | 8 | 0.001 | 256 | 0.95 | 20000 | 0.01 | 1 | 1450.99 | 0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658 | 0.9 | 256 | 5 | 64 | 1.5 | 64 | 0.0001 | 512 | 0.95 | 20000 | 0.005 | 1 | 861.914 | 0.56506 |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 | 0.6 | 256 | 3 | 16 | 1.2 | 16 | 0.01 | 1024 | 0.9 | 8000 | 0.005 | 1 | 457.482 | 0.56582 |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504 | 0.6 | 512 | 5 | 32 | 2 | 8 | 0 | 2048 | 0.95 | 10000 | 0.025 | 1 | 447.483 | 0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
Versions / Dependencies
Ray 1.9.1
Reproduction script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.
Anything else
This problem is highly reproducible for me.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:67 (35 by maintainers)
Top Results From Across the Web
2053343 – Cluster Autoscaler not scaling down nodes which ...
Cause: Catalog source pods from operator-marketplace were preventing nodes from draining Consequence: autoscaler would not be able to scale down effectively ...
Read more >Ray Core API — Ray 2.2.0 - the Ray documentation
This storage path must be accessible by all nodes of the cluster, otherwise an error will be raised. This option can also be...
Read more >Ray Documentation - Read the Docs
It takes a Python object and copies it to the local object store (here local ... ray down ray/python/ray/autoscaler/aws/example-full.yaml.
Read more >Amazon Elastic Container Service - Best Practices Guide
Tasks may be stopped due to application errors, health check failures, completion of business workflows or even manual termination by a user.
Read more >Sample Questions Archives - Page 3 of 5
Which of the below mentioned services does not provide detailed monitoring with CloudWatch? AWS EMR; AWS RDS; AWS ELB; AWS Route53. A user...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hmm do we know what created those objects and what is referencing them? “ray memory” can show you more information on this.
Yes, thank you @mwtian and @iycheng that worked! At least it worked fine in my single worker node repro scenario above, so hopefully, it will work in general.
I was able to do the “pip install” for the workers using the “setupCommands:” and I was able to the “pip install” for the head here: