Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Resume sweeps

See original GitHub issue

Is your feature request related to a problem? Please describe. In the documentation on resuming runs, there is a note that says:

Note that resuming a run which was executed as part of a Sweep is not supported.

I often want to resume runs that I executed during a sweep – for example, to train for longer if they haven’t finished converging.

Describe the solution you’d like I’d love if resuming sweeps was supported!

Describe alternatives you’ve considered The alternatives I see are all undesirable – loading the run from a saved checkpoint and training in a new run (loses connection to all the old plots), or training from scratch with more epochs (waste of time and compute).

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:13 (4 by maintainers)

Top GitHub Comments

1reaction

vanpeltcommented, Jan 11, 2022

@GeoffNN their are likely 2 ways to fix this:

call wandb.finish(exit_code=1) after you mark the run as prempting
Make your process exit with a non-zero status, exit(1)

1reaction

dannygoldsteincommented, Jan 11, 2022

hi @GeoffNN, this is not actually the feature I was talking about, we have another one coming out soon called rewind (has been delayed a bit) that should make addressing this issue easier.

the one you have found, preemptible sweeps, is where mark_preempting() comes from.

wandb.mark_preempting() will mark a run preempting, but the run is not requeued until the status is preempted. the status change preempting->preempted happens when the run exits with status (maybe your signal handler is preventing this) or after the run spends 5 minutes in the preempting state and our backend receives no heartbeats from the run.

if the run exits successfully (with zero status) after being put into the preempting state we assume the run finished successfully before being preempted by the server and the run state is set to finished. in this case the run will not be requeued.