question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regarding submitit checkpointing

See original GitHub issue

Hello,

In run_with_submitit.py, you set an output_dir for each job, thus the checkpoint path for a job is [checkpoint_dir]/[job_id]/checkpoint.pth. Now for example, if the current job 245 has reached the time limit and has been killed (while training hasn’t finished yet), the next job 246 will be launched, but it won’t resume from [checkpoint_dir]/245/checkpoint.pth because it will only look for [checkpoint_dir]/246/checkpoint.pth, which doesn’t exist.

Am I correct?

Thanks in advance for your answer!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jrapincommented, Sep 25, 2020

Thanks. According to the documentation, it seems that the checkpoint() function is called whenever Slurm sends a timeout signal. This means that the waste of computation that I described in the previous message cannot happen: even if the time limit is reached in the middle of a training, the current checkpoint is saved and will be resumed after the job has been requeued.

Exactly. However, I just checked the code, and the checkpoint method in detr actually does not save the current state (in other words, this checkpoint method does not checkpoint 😄), it will only requeue a job that is starting from the last checkpoint:

https://github.com/facebookresearch/detr/blob/699bf53f3e3ecd4f000007b8473eda6a08a8bed6/run_with_submitit.py#L52-L63

I do not know if that would be a good practice or not to checkpoint from the middle of an epoch, it’s not for submitit to decide in any case 😉

I forgot one thing for context, actually by default submitit does not requeue on timeouts, only on preemptions. It only requeues on timeouts if there is a checkpoint method. This is to avoid requeueing jobs that are bound to timeout again (some cause, same consequence). And it only requeues a number of times (slurm_max_num_timeouts in the executor init)

0reactions
jrapincommented, Oct 13, 2020

Maybe one can somehow fake a preemption signal from Python to force a re-queue

There is no need, maybe I should have emphasized that it does not requeue by default, i.e. is the checkpoint method is not implemented. Here the method is implemented, so it should be requeued on both timeouts and preemptions

Read more comments on GitHub >

github_iconTop Results From Across the Web

submitit/checkpointing.md at main - GitHub
The basics of checkpointing with submitit. Checkpointing is trickier and requires a precise understanding of the inner working of the job pickling. At...
Read more >
submitit - PyPI
Submitit is a lightweight tool for submitting Python functions for computation within ... Checkpointing: to understand how you can configure your job to...
Read more >
4.37 kB - Hugging Face
See the License for the specific language governing permissions and # limitations under the License. """ A script to run multinode training with...
Read more >
What's going on? I can't write anything or submit it when I reach ...
What's going on? I can't write anything or submit it when I reach Checkpoint 1 ... I got my 300 SG tokens back...
Read more >
7 Checkpoints Tool - Auxano
If you would like to learn more about the 7 Checkpoints Tool, complete the form below. When you submit it, you will receive...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found