Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tune] HyperBandScheduler throws TuneError for some max_t - reduction_factor pairs

See original GitHub issue

What is the problem?

OS: Ubuntu
Ray version: 0.8.0 (installed through pip)

For certain combinations of the max_t and reduction_factor arguments in ray.tune.schedulers.HyperBandScheduler, I get the following error once no more Trials are in PENDING mode:

ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.

For some reason Tune is not continuing the Trials that are in PAUSED mode (which is what I expected to happen).

I’ve not been able to really find a pattern for which combinations of max_t and reduction_factor cause this. I’ve included one example below (a simple adaptation of the example for HyperBandScheduler, where the only thing I have changed is the value of reduction_factor).

Reproduction

I’ve put the following in a script hb_test.py

#!/usr/bin/env python

import argparse
import json
import os
import random

import numpy as np

import ray
from ray.tune import Trainable, run, Experiment, sample_from
from ray.tune.schedulers import HyperBandScheduler


class MyTrainableClass(Trainable):
    """Example agent whose learning curve is a random sigmoid.
    The dummy hyperparameters "width" and "height" determine the slope and
    maximum reward value reached.
    """

    def _setup(self, config):
        self.timestep = 0

    def _train(self):
        self.timestep += 1
        v = np.tanh(float(self.timestep) / self.config.get("width", 1))
        v *= self.config.get("height", 1)

        # Here we use `episode_reward_mean`, but you can also report other
        # objectives such as loss or accuracy.
        return {"episode_reward_mean": v}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    ray.init()

    # Hyperband early stopping, configured with `episode_reward_mean` as the
    # objective and `training_iteration` as the time unit,
    # which is automatically filled by Tune.
    hyperband = HyperBandScheduler(
        time_attr="training_iteration",
        metric="episode_reward_mean",
        mode="max",
        max_t=100,
        #####################################
        # Change is located here
        #####################################
        reduction_factor=2, # this crashes for me, whereas setting reduction_factor = 3 works
    )

    exp = Experiment(
        name="hyperband_test",
        run=MyTrainableClass,
        num_samples=20,
        stop={"training_iteration": 1 if args.smoke_test else 99999},
        config={
            "width": sample_from(lambda spec: 10 + int(90 * random.random())),
            "height": sample_from(lambda spec: int(100 * random.random()))
        })

    run(exp, scheduler=hyperband, verbose=1)

and created a new conda environment

conda create --name ray-bug python=3.6
conda activate ray-bug
pip install ray tabulate requests psutil
python hb_test.py

which generates the error

Traceback (most recent call last):
  File "hb_test.py", line 72, in <module>
    run(exp, scheduler=hyperband, verbose=1)
  File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/tune.py", line 304, in run
    runner.step()
  File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 341, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/trial_executor.py", line 175, in on_no_available_trials
    raise TuneError("There are paused trials, but no more pending "
ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.

However, setting reduction_factor = 3 (like in the original example) makes the code work.

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

richardliawcommented, Aug 27, 2020

Could you open a new issue to track that? Thanks!

0reactions

shaandesaiqbcommented, Aug 26, 2020

Hi @richardliaw I upgraded to Ray 0.8.7 and this issue appears to be resolved. I had a follow up: I noticed that in 0.8.7 the max_concurrency parameter for hyperopt is deprecated (in favor of concurrency_limiter) but in BOHB it isn’t, is there a reason for this?

Top Results From Across the Web

Trial Schedulers (tune.schedulers) — Ray 2.2.0

Implements the HyperBand early stopping algorithm. HyperBandScheduler early stops trials using the HyperBand optimization algorithm. It divides trials into ...

Ray Documentation - Read the Docs

To test if the installation was successful, try running some tests. This assumes that you've cloned the git repository.

A Novice's Guide to Hyperparameter Optimization at Scale |

TL;DR: Running HPO at scale is important and Ray Tune makes that easy. When considering what HPO strategies to use for your project, ......

Ray Documentation - UserManual.wiki

Ray comes with libraries that accelerate deep learning and reinforcement learning development: • Tune: Scalable Hyperparameter Search.

Ray Documentation. Release dev0. The Ray Team. Jan 24 ...

3 Installation 1 Quick Start 3 2 Tune Quick Start 5 3 RLlib Quick Start ... Finally, we ve also included some content...