Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] stopping criterion to propagate to other trials

See original GitHub issue

We use ray.tune to do automated hyperparameter tuning (yay 🙌). To check we have done everything correctly in our projects, we often have an integration test that runs tune and checks if an Experiment can reach a very achievable accuracy in a certain time.

Currently it seems that the stopping criterion that you pass to a ray.tune.Experiment is only applied per trial (that is if one trial reaches it it stops, but the other worse trials keep running). Since we are satisfied if any trial reaches the stopping criterion, my question:

Is there a way to stop the entire Experiment (and get a result from ray.tune.run) once any trial reaches the stopping criterion?

Thanks in advance! And thanks for building ray 🧡

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

richardliawcommented, Sep 18, 2019

Ah, I see; I’ll get to the programmatic stopping condition PR this weekend. That should allow you to do what you want.

0reactions

alexabstreitercommented, Oct 30, 2019

I am running a basic PBT on MNIST. I logged self.should_stop at every call of stop in the Stopper class and after the stopping criterion was fulfilled.

The first trial 85bb094a fulfills the criterion after the first iteration, self.should_stop is set to True, and its memory address changes. After that, the other two trials 85bb8e2e and 85ba6ae4 read self.should_stop at its old memory address storing False, but I would expect them to read the new True value, which was set by trial 85bb094a.

These are the logs:

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.2/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 1, 'PENDING': 2})
PENDING trials:
 - resnet_1:	PENDING
 - resnet_2:	PENDING
RUNNING trials:
 - resnet_0:	RUNNING

(pid=15838) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
(pid=15837) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
(pid=15836) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.

INFO: StopCalled: time: 2019-10-30 12:22:42.641582 pid: 15819 trial: 85bb094a, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:22:42.641800 pid: 15819 trial: 85bb094a, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 2/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.3/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 2, 'TERMINATED': 1})
RUNNING trials:
 - resnet_0:	RUNNING
 - resnet_2:	RUNNING
TERMINATED trials:
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts

INFO: StopCalled: time: 2019-10-30 12:22:42.970515 pid: 15819 trial: 85bb8e2e, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:22:43.243287 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:22:57.266143 pid: 15819 trial: 85bb8e2e, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:22:57.266312 pid: 15819 trial: 85bb8e2e, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.5/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 1, 'TERMINATED': 2})
RUNNING trials:
 - resnet_0:	RUNNING, [1 CPUs, 0 GPUs], [pid=15838], 18 s, 1 iter, 300 ts
TERMINATED trials:
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts

INFO:image_classification_training_test:StopCalled: time: 2019-10-30 12:22:57.636775 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:23:05.347539 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:23:05.347709 pid: 15819 trial: 85ba6ae4, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.4/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'TERMINATED': 3})
TERMINATED trials:
 - resnet_0:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15838], 40 s, 3 iter, 900 ts
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.4/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'TERMINATED': 3})
TERMINATED trials:
 - resnet_0:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15838], 40 s, 3 iter, 900 ts
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts