question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] stopping criterion to propagate to other trials

See original GitHub issue

We use ray.tune to do automated hyperparameter tuning (yay 🙌). To check we have done everything correctly in our projects, we often have an integration test that runs tune and checks if an Experiment can reach a very achievable accuracy in a certain time.

Currently it seems that the stopping criterion that you pass to a ray.tune.Experiment is only applied per trial (that is if one trial reaches it it stops, but the other worse trials keep running). Since we are satisfied if any trial reaches the stopping criterion, my question:

Is there a way to stop the entire Experiment (and get a result from ray.tune.run) once any trial reaches the stopping criterion?

Thanks in advance! And thanks for building ray 🧡

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Sep 18, 2019

Ah, I see; I’ll get to the programmatic stopping condition PR this weekend. That should allow you to do what you want.

0reactions
alexabstreitercommented, Oct 30, 2019

I am running a basic PBT on MNIST. I logged self.should_stop at every call of stop in the Stopper class and after the stopping criterion was fulfilled.

The first trial 85bb094a fulfills the criterion after the first iteration, self.should_stop is set to True, and its memory address changes. After that, the other two trials 85bb8e2e and 85ba6ae4 read self.should_stop at its old memory address storing False, but I would expect them to read the new True value, which was set by trial 85bb094a.

These are the logs:

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.2/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 1, 'PENDING': 2})
PENDING trials:
 - resnet_1:	PENDING
 - resnet_2:	PENDING
RUNNING trials:
 - resnet_0:	RUNNING

(pid=15838) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
(pid=15837) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
(pid=15836) WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.

INFO: StopCalled: time: 2019-10-30 12:22:42.641582 pid: 15819 trial: 85bb094a, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:22:42.641800 pid: 15819 trial: 85bb094a, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 2/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.3/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 2, 'TERMINATED': 1})
RUNNING trials:
 - resnet_0:	RUNNING
 - resnet_2:	RUNNING
TERMINATED trials:
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts

INFO: StopCalled: time: 2019-10-30 12:22:42.970515 pid: 15819 trial: 85bb8e2e, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:22:43.243287 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:22:57.266143 pid: 15819 trial: 85bb8e2e, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:22:57.266312 pid: 15819 trial: 85bb8e2e, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.5/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'RUNNING': 1, 'TERMINATED': 2})
RUNNING trials:
 - resnet_0:	RUNNING, [1 CPUs, 0 GPUs], [pid=15838], 18 s, 1 iter, 300 ts
TERMINATED trials:
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts

INFO:image_classification_training_test:StopCalled: time: 2019-10-30 12:22:57.636775 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10

INFO: StopCalled: time: 2019-10-30 12:23:05.347539 pid: 15819 trial: 85ba6ae4, should_stop: False - read at memory address: 0x105dd9c10
INFO: Criterion fulfilled: time: 2019-10-30 12:23:05.347709 pid: 15819 trial: 85ba6ae4, should_stop: True - read at memory address: 0x105dd9bf0

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.4/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'TERMINATED': 3})
TERMINATED trials:
 - resnet_0:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15838], 40 s, 3 iter, 900 ts
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts

== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/2.0 GiB heap, 0.0/0.68 GiB objects
Memory usage on this node: 5.4/8.0 GiB
Result logdir: /Users/Alex/log/MNIST-20191030_122202
Number of trials: 3 ({'TERMINATED': 3})
TERMINATED trials:
 - resnet_0:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15838], 40 s, 3 iter, 900 ts
 - resnet_1:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15837], 18 s, 1 iter, 300 ts
 - resnet_2:	TERMINATED, [1 CPUs, 0 GPUs], [pid=15836], 32 s, 2 iter, 600 ts
Read more comments on GitHub >

github_iconTop Results From Across the Web

Stopping mechanisms (tune.stopper) - the Ray documentation
For instance, stopping mechanisms can specify to stop trials when they reached a plateau and ... Other stopping behaviors are described in the...
Read more >
Create a hyperparameter tuning job | Vertex AI - Google Cloud
In a hyperparameter tuning job, Vertex AI creates trials of your training job with ... Ignore the Enable early stopping toggle, which has...
Read more >
A Conceptual Explanation of Bayesian Hyperparameter ...
The sequential refers to running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a ...
Read more >
Pipeline Parallelism of Hyper and System Parameters Tuning ...
of techniques such as early stopping criteria to reduce the tuning impact ... dataset and let it be tuned for different number of...
Read more >
Create, test, and tune a DLP policy - Microsoft Learn
Use the 90-day Purview solutions trial to explore how robust Purview capabilities can help your organization manage data security and compliance ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found