HPC Save Writes Multiple Checkpoints
See original GitHub issueš Bug
Currently the hpc_save
function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesnāt respect the behavior of save_last
meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.
To Reproduce
This canāt exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing save_last
flag, then another flag should be introduced. This should be an easy fix and Iād be happy to PR it if the owners are agreeable to the solution.
Expected behavior
Only one checkpoint is written.
Environment
- CUDA: - GPU: - available: False - version: 10.2
- Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.7.1 - pytorch-lightning: 1.2.0 - tqdm: 4.57.0
- System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.1 - version: #1 SMP Thu Jan 21 16:15:07 EST 2021
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (12 by maintainers)
It seems to me like all this logic should be integrated into
ModelCheckpoint
, so when training is over andon_train_end
is calledModelCheckpoint
is the one who does all this.Like, why do we
hpc_save
manually here? https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L35This issue has been automatically marked as stale because it hasnāt had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!