Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regarding submitit checkpointing

See original GitHub issue

Hello,

In run_with_submitit.py, you set an output_dir for each job, thus the checkpoint path for a job is [checkpoint_dir]/[job_id]/checkpoint.pth. Now for example, if the current job 245 has reached the time limit and has been killed (while training hasn’t finished yet), the next job 246 will be launched, but it won’t resume from [checkpoint_dir]/245/checkpoint.pth because it will only look for [checkpoint_dir]/246/checkpoint.pth, which doesn’t exist.

Am I correct?

Thanks in advance for your answer!

Issue Analytics

State:
Created 3 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

jrapincommented, Sep 25, 2020

Thanks. According to the documentation, it seems that the checkpoint() function is called whenever Slurm sends a timeout signal. This means that the waste of computation that I described in the previous message cannot happen: even if the time limit is reached in the middle of a training, the current checkpoint is saved and will be resumed after the job has been requeued.

Exactly. However, I just checked the code, and the checkpoint method in detr actually does not save the current state (in other words, this checkpoint method does not checkpoint 😄), it will only requeue a job that is starting from the last checkpoint:

https://github.com/facebookresearch/detr/blob/699bf53f3e3ecd4f000007b8473eda6a08a8bed6/run_with_submitit.py#L52-L63

I do not know if that would be a good practice or not to checkpoint from the middle of an epoch, it’s not for submitit to decide in any case 😉

I forgot one thing for context, actually by default submitit does not requeue on timeouts, only on preemptions. It only requeues on timeouts if there is a checkpoint method. This is to avoid requeueing jobs that are bound to timeout again (some cause, same consequence). And it only requeues a number of times (slurm_max_num_timeouts in the executor init)

0reactions

jrapincommented, Oct 13, 2020

Maybe one can somehow fake a preemption signal from Python to force a re-queue

There is no need, maybe I should have emphasized that it does not requeue by default, i.e. is the checkpoint method is not implemented. Here the method is implemented, so it should be requeued on both timeouts and preemptions

Top Results From Across the Web

submitit/checkpointing.md at main - GitHub

The basics of checkpointing with submitit. Checkpointing is trickier and requires a precise understanding of the inner working of the job pickling. At...

submitit - PyPI

Submitit is a lightweight tool for submitting Python functions for computation within ... Checkpointing: to understand how you can configure your job to...

4.37 kB - Hugging Face

See the License for the specific language governing permissions and # limitations under the License. """ A script to run multinode training with...

What's going on? I can't write anything or submit it when I reach ...

What's going on? I can't write anything or submit it when I reach Checkpoint 1 ... I got my 300 SG tokens back...

7 Checkpoints Tool - Auxano

If you would like to learn more about the 7 Checkpoints Tool, complete the form below. When you submit it, you will receive...