question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Resume sweeps

See original GitHub issue

Is your feature request related to a problem? Please describe. In the documentation on resuming runs, there is a note that says:

Note that resuming a run which was executed as part of a Sweep is not supported.

I often want to resume runs that I executed during a sweep – for example, to train for longer if they haven’t finished converging.

Describe the solution you’d like I’d love if resuming sweeps was supported!

Describe alternatives you’ve considered The alternatives I see are all undesirable – loading the run from a saved checkpoint and training in a new run (loses connection to all the old plots), or training from scratch with more epochs (waste of time and compute).

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
vanpeltcommented, Jan 11, 2022

@GeoffNN their are likely 2 ways to fix this:

  1. call wandb.finish(exit_code=1) after you mark the run as prempting
  2. Make your process exit with a non-zero status, exit(1)
1reaction
dannygoldsteincommented, Jan 11, 2022

hi @GeoffNN, this is not actually the feature I was talking about, we have another one coming out soon called rewind (has been delayed a bit) that should make addressing this issue easier.

the one you have found, preemptible sweeps, is where mark_preempting() comes from.

wandb.mark_preempting() will mark a run preempting, but the run is not requeued until the status is preempted. the status change preempting->preempted happens when the run exits with status (maybe your signal handler is preventing this) or after the run spends 5 minutes in the preempting state and our backend receives no heartbeats from the run.

if the run exits successfully (with zero status) after being put into the preempting state we assume the run finished successfully before being preempted by the server and the run state is set to finished. in this case the run will not be requeued.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sweeps UI - Documentation - Weights & Biases
User interface for controlling in-progress hyperparameter sweeps. ... You can pause, resume, stop, or kill a sweep from the interface.
Read more >
Trying to resume sweep after crashing due to CUDA memory ...
So I am trying to resume by loading the previos sweep id and calling again the function. wandb.agent(sweep_id, function=lambda: ...
Read more >
Chimney Sweep Resume Example - LiveCareer
I am currently working as a chimney sweep for Advanced Chimney Technologies. My job as a chimney sweep is to inspect and repair...
Read more >
CareerBuilder Superstar Resume Sweepstakes Official Rules
READY FOR A NEW ROLE, OR CAREER? ... Upload or Build a resume to unlock your profile. To unlock your profile and take...
Read more >
Sweep — TeNPy 0.8.3 documentation
Return necessary data to resume a run() interrupted at a checkpoint. ... This function is useful to (re-)start a Sweep with a slightly...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found