[Tune] HyperBandScheduler throws TuneError for some max_t - reduction_factor pairs
See original GitHub issueWhat is the problem?
- OS: Ubuntu
- Ray version: 0.8.0 (installed through
pip
)
For certain combinations of the max_t
and reduction_factor
arguments in ray.tune.schedulers.HyperBandScheduler
, I get the following error once no more Trials are in PENDING mode:
ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.
For some reason Tune is not continuing the Trials that are in PAUSED mode (which is what I expected to happen).
I’ve not been able to really find a pattern for which combinations of max_t
and reduction_factor
cause this. I’ve included one example below (a simple adaptation of the example for HyperBandScheduler
, where the only thing I have changed is the value of reduction_factor
).
Reproduction
I’ve put the following in a script hb_test.py
#!/usr/bin/env python
import argparse
import json
import os
import random
import numpy as np
import ray
from ray.tune import Trainable, run, Experiment, sample_from
from ray.tune.schedulers import HyperBandScheduler
class MyTrainableClass(Trainable):
"""Example agent whose learning curve is a random sigmoid.
The dummy hyperparameters "width" and "height" determine the slope and
maximum reward value reached.
"""
def _setup(self, config):
self.timestep = 0
def _train(self):
self.timestep += 1
v = np.tanh(float(self.timestep) / self.config.get("width", 1))
v *= self.config.get("height", 1)
# Here we use `episode_reward_mean`, but you can also report other
# objectives such as loss or accuracy.
return {"episode_reward_mean": v}
def _save(self, checkpoint_dir):
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"timestep": self.timestep}))
return path
def _restore(self, checkpoint_path):
with open(checkpoint_path) as f:
self.timestep = json.loads(f.read())["timestep"]
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
args, _ = parser.parse_known_args()
ray.init()
# Hyperband early stopping, configured with `episode_reward_mean` as the
# objective and `training_iteration` as the time unit,
# which is automatically filled by Tune.
hyperband = HyperBandScheduler(
time_attr="training_iteration",
metric="episode_reward_mean",
mode="max",
max_t=100,
#####################################
# Change is located here
#####################################
reduction_factor=2, # this crashes for me, whereas setting reduction_factor = 3 works
)
exp = Experiment(
name="hyperband_test",
run=MyTrainableClass,
num_samples=20,
stop={"training_iteration": 1 if args.smoke_test else 99999},
config={
"width": sample_from(lambda spec: 10 + int(90 * random.random())),
"height": sample_from(lambda spec: int(100 * random.random()))
})
run(exp, scheduler=hyperband, verbose=1)
and created a new conda environment
conda create --name ray-bug python=3.6
conda activate ray-bug
pip install ray tabulate requests psutil
python hb_test.py
which generates the error
Traceback (most recent call last):
File "hb_test.py", line 72, in <module>
run(exp, scheduler=hyperband, verbose=1)
File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/tune.py", line 304, in run
runner.step()
File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 341, in step
self.trial_executor.on_no_available_trials(self)
File "/home/thomas/anaconda3/envs/ray-bug/lib/python3.6/site-packages/ray/tune/trial_executor.py", line 175, in on_no_available_trials
raise TuneError("There are paused trials, but no more pending "
ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.
However, setting reduction_factor = 3 (like in the original example) makes the code work.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Could you open a new issue to track that? Thanks!
Hi @richardliaw I upgraded to Ray 0.8.7 and this issue appears to be resolved. I had a follow up: I noticed that in 0.8.7 the max_concurrency parameter for hyperopt is deprecated (in favor of concurrency_limiter) but in BOHB it isn’t, is there a reason for this?