question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] pbt - checkpointing trials and stopping

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray version: 0.8.0
  • Python version: 3.6.9

Experiment I have been experimenting with PBT for the training of a convnet in pytorch. Everything is working fine but I am a bit frustrated with the checkpointing. My _save and _restore methods are similar to https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_convnet_example.py I am running an experiment with 4 samples, each one on a single gpu. My stopping condition is an accuracy threshold.

Questions 1 - Is it possible to stop all the trials as soon as one reaches the stopping condition? For example, in the training below, the blue trial reached the stopping condition and the orange should be killed. image

2 - The top performers models are saved in temp directories during training. How is it possible for me to recover them if the training crashes / I want to kill it early? I understand that I could checkpoint every epoch with checkpoint_freq but it seems a bit sub-optimal. Is there a way to save the current best model in the trial local directory?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

3reactions
ujvlcommented, Dec 20, 2019

1 - Is it possible to stop all the trials as soon as one reaches the stopping condition? For example, in the training below, the blue trial reached the stopping condition and the orange should be killed.

see https://ray.readthedocs.io/en/latest/tune-usage.html#custom-stopping-criteria, pass in a stateful custom stopping criteria function as in the example

Is there a way to save the current best model in the trial local directory?

Set checkpoint_score_attr to whatever metric you want to use to determine how to score the checkpoints. Set keep_checkpoints_num so that the worst checkpoints are deleted.

0reactions
hhbyyhcommented, Dec 31, 2019

Thanks for the reply @ujvl Your suggestion should work. I’m thinking about two things:

  1. What’s the right behavior when user only specified checkpoint_score_attr and keep_checkpoints_num? Right now we don’t do anything unless checkpoint_freq is set. I’m not sure if it’s right.

  2. What’s the right behavior when users specify all three parameters, checkpoint_score_attr, keep_checkpoints_num, and checkpoint_freq. Right now it will strictly respect checkpoint_freq. This is fine for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stopping and Resuming a Tune Run - the Ray documentation
Upon resuming an interrupted/errored Tune run: Tune first looks at the experiment-level checkpoint to find the list of trials at the time of...
Read more >
Hyperparameter Search: Population-based training
Cloning a trial involves checkpointing it and creating a new trial that continues training from that checkpoint. The hyperparameters of the new trial...
Read more >
Anyscale Connect: Population Based Training with Ray Tune
Population Based Training ( PBT ) is a method for hyperparameter optimization algorithm that trains many models in parallel and uses ...
Read more >
Hyperparameter tuning with Ray Tune - PyTorch
These metrics can also be used to stop bad performing trials early in order to avoid wasting resources on those trials. The checkpoint...
Read more >
tune.py - ray-project/ray - Sourcegraph
"Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. " ... Otherwise, trials could overwrite artifacts and checkpoints. of other trials.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found