Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HPC Save Writes Multiple Checkpoints

See original GitHub issue

🐛 Bug

Currently the hpc_save function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesn’t respect the behavior of save_last meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.

To Reproduce

This can’t exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing save_last flag, then another flag should be introduced. This should be an easy fix and I’d be happy to PR it if the owners are agreeable to the solution.

Expected behavior

Only one checkpoint is written.

Environment

CUDA: - GPU: - available: False - version: 10.2
Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.7.1 - pytorch-lightning: 1.2.0 - tqdm: 4.57.0
System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.1 - version: #1 SMP Thu Jan 21 16:15:07 EST 2021

Issue Analytics

State:
Created 3 years ago
Comments:15 (12 by maintainers)

Top GitHub Comments

1reaction

carmoccacommented, Mar 1, 2021

It seems to me like all this logic should be integrated into ModelCheckpoint, so when training is over and on_train_end is called ModelCheckpoint is the one who does all this.

Like, why do we hpc_save manually here? https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L35

0reactions

stale[bot]commented, Jul 31, 2021

This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Top Results From Across the Web

HPC Save Writes Multiple Checkpoints · Issue #6204 - GitHub

This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk...

Introduction to Backfill and Checkpoints

From the HPC standpoint, check-pointing means saving your work after every heavy computation, so that it can be resumed later rather than having...

Snapshotting Jobs - HPC Wiki

DMTCP (Distributed MultiThreaded CheckPointing) is a tool to create snapshots of currently running jobs. This is achieved by saving the current ...

MemVerge supporting distributed HPC app checkpointing

Checkpointing is the saving of an application's state so that it can be restarted if the system fails. Saving the state of a...

A Checkpoint-Optimized Storage System for HPC Applications

High-speed writes: The checkpoint storage system should provide excellent write performance and several asynchronous write optimizations so applications can.