Regarding submitit checkpointing
See original GitHub issueHello,
In run_with_submitit.py
, you set an output_dir
for each job, thus the checkpoint path for a job is [checkpoint_dir]/[job_id]/checkpoint.pth
. Now for example, if the current job 245 has reached the time limit and has been killed (while training hasn’t finished yet), the next job 246 will be launched, but it won’t resume from [checkpoint_dir]/245/checkpoint.pth
because it will only look for [checkpoint_dir]/246/checkpoint.pth
, which doesn’t exist.
Am I correct?
Thanks in advance for your answer!
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
submitit/checkpointing.md at main - GitHub
The basics of checkpointing with submitit. Checkpointing is trickier and requires a precise understanding of the inner working of the job pickling. At...
Read more >submitit - PyPI
Submitit is a lightweight tool for submitting Python functions for computation within ... Checkpointing: to understand how you can configure your job to...
Read more >4.37 kB - Hugging Face
See the License for the specific language governing permissions and # limitations under the License. """ A script to run multinode training with...
Read more >What's going on? I can't write anything or submit it when I reach ...
What's going on? I can't write anything or submit it when I reach Checkpoint 1 ... I got my 300 SG tokens back...
Read more >7 Checkpoints Tool - Auxano
If you would like to learn more about the 7 Checkpoints Tool, complete the form below. When you submit it, you will receive...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Exactly. However, I just checked the code, and the checkpoint method in detr actually does not save the current state (in other words, this checkpoint method does not checkpoint 😄), it will only requeue a job that is starting from the last checkpoint:
https://github.com/facebookresearch/detr/blob/699bf53f3e3ecd4f000007b8473eda6a08a8bed6/run_with_submitit.py#L52-L63
I do not know if that would be a good practice or not to checkpoint from the middle of an epoch, it’s not for submitit to decide in any case 😉
I forgot one thing for context, actually by default submitit does not requeue on timeouts, only on preemptions. It only requeues on timeouts if there is a checkpoint method. This is to avoid requeueing jobs that are bound to timeout again (some cause, same consequence). And it only requeues a number of times (
slurm_max_num_timeouts
in the executor init)There is no need, maybe I should have emphasized that it does not requeue by default, i.e. is the
checkpoint
method is not implemented. Here the method is implemented, so it should be requeued on both timeouts and preemptions