[Feature] Resume sweeps
See original GitHub issueIs your feature request related to a problem? Please describe. In the documentation on resuming runs, there is a note that says:
Note that resuming a run which was executed as part of a Sweep is not supported.
I often want to resume runs that I executed during a sweep – for example, to train for longer if they haven’t finished converging.
Describe the solution you’d like I’d love if resuming sweeps was supported!
Describe alternatives you’ve considered The alternatives I see are all undesirable – loading the run from a saved checkpoint and training in a new run (loses connection to all the old plots), or training from scratch with more epochs (waste of time and compute).
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:13 (4 by maintainers)
Top Results From Across the Web
Sweeps UI - Documentation - Weights & Biases
User interface for controlling in-progress hyperparameter sweeps. ... You can pause, resume, stop, or kill a sweep from the interface.
Read more >Trying to resume sweep after crashing due to CUDA memory ...
So I am trying to resume by loading the previos sweep id and calling again the function. wandb.agent(sweep_id, function=lambda: ...
Read more >Chimney Sweep Resume Example - LiveCareer
I am currently working as a chimney sweep for Advanced Chimney Technologies. My job as a chimney sweep is to inspect and repair...
Read more >CareerBuilder Superstar Resume Sweepstakes Official Rules
READY FOR A NEW ROLE, OR CAREER? ... Upload or Build a resume to unlock your profile. To unlock your profile and take...
Read more >Sweep — TeNPy 0.8.3 documentation
Return necessary data to resume a run() interrupted at a checkpoint. ... This function is useful to (re-)start a Sweep with a slightly...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@GeoffNN their are likely 2 ways to fix this:
wandb.finish(exit_code=1)
after you mark the run as premptingexit(1)
hi @GeoffNN, this is not actually the feature I was talking about, we have another one coming out soon called rewind (has been delayed a bit) that should make addressing this issue easier.
the one you have found, preemptible sweeps, is where
mark_preempting()
comes from.wandb.mark_preempting()
will mark a runpreempting
, but the run is not requeued until the status ispreempted
. the status changepreempting
->preempted
happens when the run exits with status (maybe your signal handler is preventing this) or after the run spends 5 minutes in the preempting state and our backend receives no heartbeats from the run.if the run exits successfully (with zero status) after being put into the preempting state we assume the run finished successfully before being preempted by the server and the run state is set to finished. in this case the run will not be requeued.