question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HPC Save Writes Multiple Checkpoints

See original GitHub issue

šŸ› Bug

Currently the hpc_save function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesn’t respect the behavior of save_last meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.

To Reproduce

This can’t exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing save_last flag, then another flag should be introduced. This should be an easy fix and I’d be happy to PR it if the owners are agreeable to the solution.

Expected behavior

Only one checkpoint is written.

Environment

  • CUDA: - GPU: - available: False - version: 10.2
  • Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.7.1 - pytorch-lightning: 1.2.0 - tqdm: 4.57.0
  • System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.1 - version: #1 SMP Thu Jan 21 16:15:07 EST 2021

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
carmoccacommented, Mar 1, 2021

It seems to me like all this logic should be integrated into ModelCheckpoint, so when training is over and on_train_end is called ModelCheckpoint is the one who does all this.

Like, why do we hpc_save manually here? https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L35

0reactions
stale[bot]commented, Jul 31, 2021

This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Read more comments on GitHub >

github_iconTop Results From Across the Web

HPC Save Writes Multiple Checkpoints Ā· Issue #6204 - GitHub
This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk...
Read more >
Introduction to Backfill and Checkpoints
From the HPC standpoint, check-pointing means saving your work after every heavy computation, so that it can be resumed later rather than having...
Read more >
Snapshotting Jobs - HPC Wiki
DMTCP (Distributed MultiThreaded CheckPointing) is a tool to create snapshots of currently running jobs. This is achieved by saving the currentĀ ...
Read more >
MemVerge supporting distributed HPC app checkpointing
Checkpointing is the saving of an application's state so that it can be restarted if the system fails. Saving the state of a...
Read more >
A Checkpoint-Optimized Storage System for HPC Applications
High-speed writes: The checkpoint storage system should provide excellent write performance and several asynchronous write optimizations so applications can.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found