Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job

The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3.  I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.

At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.

Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
 MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
 ForestCover       Ray tune 1hr job ran from 11:22-12:22
 Higgs             Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.

Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)

+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name     | status     | loc             |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   training.batch_size |   training.decay_rate |   training.decay_steps |   training.learning_rate |   iter |   total time (s) |   metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING    | 10.0.4.71:1938  |                   0.7  |                     2048 |                    4 |                      8 |                          1   |              32 |              0.0001 |                  8192 |                  0.95 |                    500 |                    0.025 |     18 |         3566.91  |       0.489641 |
| trial_3787a9c4 | RUNNING    | 10.0.6.5:17263  |                   0.9  |                     4096 |                    7 |                     24 |                          1.5 |              64 |              0      |                   256 |                  0.9  |                  10000 |                    0.01  |        |                  |                |
| trial_39a0ad6e | RUNNING    | 10.0.4.38:8657  |                   0.8  |                      256 |                    3 |                     16 |                          1.2 |              64 |              0.001  |                  4096 |                  0.95 |                   2000 |                    0.005 |      4 |         1268.2   |       0.50659  |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 |                   0.8  |                      256 |                    9 |                    128 |                          1   |              32 |              0      |                  2048 |                  0.95 |                  10000 |                    0.005 |      1 |          913.295 |       0.53046  |
| trial_059befa6 | TERMINATED | 10.0.3.245:282  |                   0.98 |                     1024 |                    3 |                      8 |                          1   |               8 |              1e-06  |                  1024 |                  0.8  |                    500 |                    0.005 |      1 |          316.455 |       0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281  |                   0.8  |                     1024 |                    7 |                     24 |                          2   |               8 |              0.001  |                   256 |                  0.95 |                  20000 |                    0.01  |      1 |         1450.99  |       0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658  |                   0.9  |                      256 |                    5 |                     64 |                          1.5 |              64 |              0.0001 |                   512 |                  0.95 |                  20000 |                    0.005 |      1 |          861.914 |       0.56506  |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 |                   0.6  |                      256 |                    3 |                     16 |                          1.2 |              16 |              0.01   |                  1024 |                  0.9  |                   8000 |                    0.005 |      1 |          457.482 |       0.56582  |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504  |                   0.6  |                      512 |                    5 |                     32 |                          2   |               8 |              0      |                  2048 |                  0.95 |                  10000 |                    0.025 |      1 |          447.483 |       0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.

Anything else

This problem is highly reproducible for me.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:67 (35 by maintainers)

Top GitHub Comments

3reactions

ericlcommented, Jan 31, 2022

Hmm do we know what created those objects and what is referencing them? “ray memory” can show you more information on this.

2reactions

amhollercommented, Feb 3, 2022

Yes, thank you @mwtian and @iycheng that worked! At least it worked fine in my single worker node repro scenario above, so hopefully, it will work in general.

I was able to do the “pip install” for the workers using the “setupCommands:” and I was able to the “pip install” for the head here:

  headStartRayCommands:
    - ray stop
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
    ```

Top Results From Across the Web

2053343 – Cluster Autoscaler not scaling down nodes which ...

Cause: Catalog source pods from operator-marketplace were preventing nodes from draining Consequence: autoscaler would not be able to scale down effectively ...

Ray Core API — Ray 2.2.0 - the Ray documentation

This storage path must be accessible by all nodes of the cluster, otherwise an error will be raised. This option can also be...

Ray Documentation - Read the Docs

It takes a Python object and copies it to the local object store (here local ... ray down ray/python/ray/autoscaler/aws/example-full.yaml.

Amazon Elastic Container Service - Best Practices Guide

Tasks may be stopped due to application errors, health check failures, completion of business workflows or even manual termination by a user.

Sample Questions Archives - Page 3 of 5

Which of the below mentioned services does not provide detailed monitoring with CloudWatch? AWS EMR; AWS RDS; AWS ELB; AWS Route53. A user...